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Chapter  1 


Introduction 


Perhaps  one  of  the  main  impediments  to  rapid  progress  in  the  development  of  the 
social,  behavioral  and  biological  sciences  is  the  omnipresence  of  qualitative  data.  All 
too  often  it  is  simply  impossible  to  obtain  numerical  data;  the  researcher  has  the  choice 
of  qucditative  data  or  no  data  at  all,  [31] 

This  apology  (  or  excuse  )  indicates  the  need  of  analyzing  categorical  data.  A  specific  case  in 
which  categorical  data  makes  an  appearance  is  a  contingency  table.  Although  the  literature 
is  replete  with  methodologies  for  the  analysis  of  small  size  contingency  tables,  relatively  few 
methods  have  been  proposed  for  the  case  when  there  are  many  cells  of  the  tables  and  many 
missing  or  zero  values.  The  intent  of  this  thesis  is  to  indicate  that  non-linear  regression 
with  categorical  predictors  yields  a  reasonable  methodology  for  analyzing  such  large  sparse 
tables. 

The  first  section  of  this  chapter  introduces  the  concepts  of  categorical  variables  and 
of  contingency  tables.  It  is  included  purely  for  the  sake  of  completeness  and  is  intended 
to  neither  enlighten  nor  stimulate  the  knowledgeable  reader.  The  second  section  offers 
some  history  of  the  methodology  used  to  study  contingency  tables,  and  indicates  possible 
shortcomings  of  these  procedures  when  applied  to  large  sparse  tables. 


1.1.  Basics. 


A  categorical  variable  is  a  measurable  mapping  from  some  measure  space  to  a  finite 
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set.  The  elements  of  the  range  are  called  categories.  Classifications  such  as  sex,  occupation, 
race,  and  marital  status  immediately  present  themselves  as  examples  of  categorical  variables. 
Male,  female,  sometimes,  and  never  are  examples  of  specific  categories  of  the  classification 
“sex”.  Any  random  variable  can  be  transformed  into  a  categorical  variable.  Let  X  be 
a  random  variable,  a  partition  of  the  range  of  X,  and  {o£y}y_j  a  set.  Then  the 

mapping 

C{X)  =  ai  if  Xe  Pi  (1.1.1) 

is  categorical. 

Prom  this  construct  it  is  clear  that  all  data  may  be  considered  categorical,  since  any 
measurement  is  inherently  of  finite  precision  and  consequently  can  only  be  reported  as  lying 
in  a  certain  interval.  Although  this  view  may  seem  a  bit  extreme  and  counter-productive, 
it  is  not  without  its  proponents  [31].  The  approach  taken  here  is  to  consider  any  map¬ 
ping  derived  by  (1.1.1)  to  be  categorical  only  if  the  number  of  categories  is  “reasonably 
small”.  What  constitutes  “reasonably  small”  is  left  xmdefined,  it  being  assumed  the  reader 
is  sufficiently  intelligent  to  interpret  it  in  a  “reasonable”  manner. 

Let  5  be  an  arbitrary  set,  and  be  mappings  (7*:  S  — ►  where  Fm*  is  some 

set  of  nik  distinct  symbols.  The  relation  ~  may  be  defined  on  S  by  si  «2  Cj{8i)  = 
Cj{s2)^j-  It  is  easy  to  show  that  is  an  equivalence  relationship.  A  mi  x  m2  x  •  •  •  x  m^ 
contingency  table  is  defined  to  be  any  function  g  which  is  constant  on  these  equivalent 
classes.  It  is  noted  that  the  functions  {Cy}  are  categorical  variables.  Consequently  a 
contingency  table  may  also  be  thought  of  as  a  function  in  which  all  points  having  the  same 
categorical  predictors  are  mapped  into  a  single  value.  Such  tables  arise  frequently  from 
the  cross-classification  of  a  population  according  to  several  characteristics.  In  this  instance 
the  set  5  would  be  the  set  of  individuals  in  the  population,  with  Ck  corresponding  to 
the  characteristic  of  the  cross-classification,  the  distinct  instances  of  the 

characteristic,  and  i^(*)  being  either  the  number  of  individuals  in  the  population  having 
the  specified  traits  or  the  proportion  of  individuals  in  the  population  having  -^He  specified 
traits.  The  elements  of  the  range  in  the  former  case  are  known  as  counts  and  in  the  latter 
proportions. 
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It  is  possible  to  consider  this  a  particular  instance  of  a  regression  model  where  the 
existence  of  a  mapping  /:5  — ►  SC  is  assumed  and  the  random  variables  observed  are  the 
(n -f-  1)  “  tuples  /(«)).  This  differs  from  the  contingency  table  in  that 

it  is  not  assumed  that  C'y(si)  =  Cj{s2)^j  implies  f{si)  =  f{s2)*  Thus,  points  having 
the  same  categorical  predictors  are  allowed  to  have  different  responses.  This  type  of  data 
commonly  arises  when  a  population  is  cross*classified  according  to  a  set  of  characteristics 
after  which  some  type  of  experiment  is  performed  and  a  continuous  response  is  observed 
for  each  member.  It  should  be  noted  that  in  this  set-up  a  contingency  table  still  may  be 
appropriate  if  the  response  is  discrete,  the  peirticular  response  may  be  considered  another 
cleissification  of  the  population. 

Given  a  mi  x  m2  x  •  •  •  m„  contingency  table  G  it  is  possible  to  construct  a  mi  x  m2  x 
•  •  •  X  m„  table  whose  (ti,t2»  •  •  entry  is  G(ii,i2,  •  •  -,*«)-  Any  entry  not  in  the  range 

of  G  is  called  a  missing  value  and  any  entry  which  is  zero  will  be  called  a  zero  entry.  These 
entries  arise  from  several  different  causes.  Two  of  the  most  common  ones  are: 

1.  Because  of  the  finiteness  of  the  sample  size  an  entry  which  would  not  be  zero  if  an 
infinite  sample  size  was  drawn  is  zero.  These  are  known  as  zeros  due  to  sampling 
variation. 

2.  Due  to  the  classification  of  the  entries  certain  combinations  may  be  impossible  or 
redundant.  Such  zeroes  dxe  known  as  structural  zeroes. 

Zeroes  due  to  sampling  variation  occur  frequently  when  the  sample  size  and  the  total 
number  of  elements  in  the  table  are  of  the  same  order  of  magnitude,  and  is  a  common 
occurrence  when  the  population  is  classified  according  to  many  characteristics.  A  nation¬ 
wide  educational  survey  may  not  have  anyone  classified  as  a  white,  eastern  European  Jewish 
male  between  twenty-five  and  thirty  years  of  age  who  farms  in  the  midwest,  although  there 
do  exist  people  belonging  to  this  category. 

A  natural  example  of  structural  zeroes  may  be  found  in  genetics.  Certain -cdlele  combi¬ 
nations  are  known  to  be  fatal  and  thus  classifying  animals  according  to  these  genotypes  will 
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necessarily  yield  zeroes  counts  for  certain  combinations.  These  entries  are  known  a  priori 
to  be  zero  and  any  analysis  of  the  data  should  take  this  into  account. 

There  have  been  many  methods  of  analyzing  contingency  tables  which  have  no  zero 
cells,  and  many  of  these  methods  have  been  successfully  adopted  to  the  case  of  structural 
zeroes.  The  same  is  not  quite  true  for  the  case  of  sampling  zeroes,  and  it  is  this  case  which 
shall  be  of  primary  interest  in  this  work. 


1.2.  Brief  Overview. 

The  literature  of  contingency  tables  is  vast  and  scattered.  No  attempt  has  been  made 
to  be  thorough  and  the  topics  presented  are  more  likely  to  represent  the  author’s  preference 
rather  than  any  concept  of  “importance”. 

The  use  of  contingency  tables  dates  back  at  least  to  the  early  nineteenth  century  with 
the  work  of  Quetelet  [26].  The  actual  analysis  of  contingency  tables,  however,  is  considered 
to  have  begun  with  Pearson  [25],  who  first  proposed  the  classical  t^st.  Pearson  adopted 
the  view  that  categorical  variables  could  always  be  thought  of  as  a  discretization  of  some 
possible  unknown  continuous  random  variable  and  insisted  that  all  analysis  be  based  on  this 
assumption.  This  necessitated  some  logical  acrobatics  to  deal  with  apparently  dichotomous 
predictors.  Hence  Pearson  considered  live  vs.  dead  to  be  extreme  values  of  some  continuous 
scale  of  “health”  and  argued  that  employed  vs.  unemployed  were  a  discretization  of  some 
continuous  “amount  of  work”  variable. 

In  the  same  year  Yule  [32]  proposed  another  theory  for  contingency  tables.  His  view  was 
that  the  categorical  variables  were  fixed  and  did  not  arise  from  any  discretization  process. 
This  was  in  direct  opposition  to  Pearson’s  work,  leading  to  a  long  and  bitter  correspondence 
between  the  two.  For  a  long  period  of  time  during  and  following  this  debate  the  study  of 
contingency  tables  was  constrained  to  2  x  2  tables.  Progress  was  made  in  1935  when  Bartlett 
[3]  derived  a  definition  for  second  order  interactions  in  2  x  2  x  2  tables  . 
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To  facilitate  the  discussion,  the  following  notation  will  be  used.  Consider  a  table 
having  three  categorical  predictors,  A,B,C  with  categories  Ai^Bj^Cki  where  t  <  r,  j  <  «, 
and  k  <  t.  Let  T,*yjfc  =  0  Bj  fl  (7*),  and  let  pij^g  be  the  sample  estimates  of  The 

standard  summation  notation  is  assumed,  ie  p.y*  =  J2iPijki  Pv-  =  Sy,*P»7*> 

Bartlett’s  definition  of  no  second  order  interactions  can  then  be  written  as 


Pill  P221  _  Pi  12  P222 
Pl21  P211  Pl22  P212 


(1.2.1) 


His  method  of  testing  required  the  solution  of  a  cubic  equation.  The  definition  of  no 
interaction  in  a  2  x  2  x  3  table  was  also  defined,  although  this  required  the  solution  of  two 
simultaneous  degree  four  equations  in  two  variables. 

Since  then  much  research  has  arisen  in  the  study  of  contingency  tables  and  interactions. 
Most  of  the  later  work,  however,  can  be  traced  to  one  of  two  methods  presented  in  the  lOSO’s. 
The  first  method  was  introduced  by  Lancaster  [24]  in  1951,  who  developed  a  method  of 
partitioning  a  statistic  in  order  to  develop  a  concept  of  interaction  in  the  general  rxsxt 
table.  In  the  notation  above,  the  hypothesis  of  no  three-way  interaction  can  be  written  as 


H: 


Pijk 


Pi.,  p.y.  p..* 


P-jk  Pi-Ai  Pij' 
p.y.  p..k  Pi..  p..k  Pf.  p.y. 


(1.2.2) 


Letting 


• _ _  _  /  P'Jk  ,  Pf.fc  ,  Pij- 

Pijk  =  Pi-  Pi-  P-k{-——  + - +  - 


P  j-  P-k  Pi-  P-k  Pi-  P  i 


-2) 


(1.2.3) 


the  chi-square  statistic  for  the  hypothesis  (1.2.2)  can  be  written  as 
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y>  {Pijk  -  Pijk)^  _  {Pijk  -  Pi-  P  i-  P  -k)^  _  y->  {P  jk  -  P  y-  P  -k)^ 
^  Pi:  Pi-  P-k  jji  Pi:  P.y  P:k  ^  P./*  P -k 


(Pi  k  -  Pi-  P-t)^ 
Pi-  P-k 


{Piy  -Pi-  Pi  ? 
Pi-  P  y 


(1.2.4) 


=  x\bc  ~  Xbc  ~  Xac  ~  Xab 


Each  term  on  the  right  hand  side  of  (1.2.4)  corresponds  to  a  test  of  independence  of  two 
(or  more)  predictors.  For  example,  x\b  is  the  statistic  for  the  hypothesis  that  factors 
A  and  B  are  independent.  It  was  these  individual  terms  that  Lancaster  used  to  dehne  the 
interactions  between  the  predictors. 


The  second  approach  came  in  1956  with  the  work  of  Roy  and  Kastenbaum  [28].  Gen¬ 
eralizing  Bartlett’s  definition  of  independence  (1.2.1)  to  the  r  x  s  x  t  table  they  defined 
interaction  in  terms  of  the  (r  —  l)(s  —  l)(t  —  1)  ratios: 


Pr9t  Fiji  j  Pfh  Pr/I 
put  Prjt  /  P%$k  Prjk 


i<r-l 

j<a-l  (1.2.5) 

ife<t-l 


The  hypothesis  of  no  interaction  was  defined  as  all  of  the  terms  being  equal  to  one.  They 
tested  the  hypothesis  of  no  interaction  by  the  maximum  likelihood  method,  leading  to 
(r  —  1)(«  —  l)(t  —  1)  simultaneous  equations  of  degree  three.  The  standard  x*  formula  was 
then  used  as  a  test  criterion. 

One  important  method  derived  from  this  approach  which  was  initiated  by  Birch  [6]  and 
used  by  many  subsequent  authors  [5],  [7],  [22],  is  the  log-linear  model.  Following  Goodman’s 
[22]  notation,  the  probabilities  pijt  were  written  as 


log  Piik  =  0  +  Xf  +  \f  +  X^  +  -h  X%^  + 


(1.2.6) 
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with  the  ANOVA-like  constraints 


=  A®  =  Ap  =  0 

A^-^  =  =  Xf^  =  A^^^  =  Xfp  =  A^'^  =  0 

\ABC  _  \ABC  _  \ABC  _  \ ABC  _  \AjBC  _  \ABC  _  n 
—  ^i  k  —  ^iy-  —  ^  y  “  ^-k  “  ^ 


for  all  t,  y,  i  (1-2.7) 


The  A’s  represented  the  possible  effects  of  the  three  variables.  The  main  effects  were  A^,  A^, 
and  A^,  the  first  order  interactions  were  represented  by  A^^,  A*^^,  and  A-®^,  while 
represented  the  second  order  interaction.  The  model  had  the  advantage  of  having  simple 
methods  for  estimating  these  effects,  allowing  straight  forward  calculations  for  testing  of 
the  presence  of  given  interactions  and  having  a  obvious  extension  to  arbitrarily  large  and 
complex  tables. 


Assuming  the  elements  in  the  table  represent  counts,  difficulties  arise  in  these  proce¬ 
dures  when  some  of  the  entries  are  zero.  The  behavior  of  Lancaster’s  statistic  is  not  well 
understood  nor  is  well-behaved  and  consequently  looses  much  of  its  appeal  in  these  cases. 
More  drastic  is  the  behavior  of  the  log-linear  model,  log  0  =  —oo,  creating  rather  unpleas¬ 
ant  consequences  in  fitting  model  (1.2.6).  Eliminating  those  categories  which  contain  zero 
entries  looses  information  and  is  unreasonable  when  the  number  of  such  entries  is  of  even 
moderate  size. 

There  has  been  two  main  lines  of  attack  to  salvage  the  above  mentioned  theories.  The 
first  and  to  some  extent  less  successful  approaches  have  dealt  with  sampling  zeroes.  One 
approach  considered  was  to  eliminate  empty  cells  or  cells  having  few  entries  by  collapsing 
adjacent  rows  of  the  table.  Craig  [12]  demonstrated  a  methodology  of  collapsing  two  ad¬ 
jacent  rows  which  yielded  consistent  estimates  of  the  corresponding  cell  means  and  a  x  ~ 
square  statistic  which  converged  asymptotically  to  a  x^  distribution.  Bishop  [8]  approached 
the  problem  from  the  log-linear  model  and  examined  conditions  under  which  collapsing  the 
table  by  adding  over  a  variable  would  not  affect  any  multi-factor  effects.  However,  for  large 
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tables  with  many  missing  cells,  the  method  of  collapsing  of  a  table  to  eliminate  all  missing 
entries  may  not  be  feasible  since  it  may  reduce  the  table  to  a  size  much  smaller  than  desired. 
Consequently,  this  method  works  best  only  when  the  number  of  zero  cells  is  a  small  fraction 
of  the  total  number  of  cells. 

Another  approach  has  been  to  replace  the  zero  entries  with  some  positive  number 
and  to  carry  out  the  analysis  with  the  altered  table.  Probably  the  simplest  method  along 
these  lines  is  due  to  Berkson  [4]  who  suggested  replacing  0  with  the  value  1/2.  Variants  of 
this  procedure  have  arisen  in  the  literature,  including  suggestions  of  adding  either  one  or 
one  half  to  all  cells  in  order  to  eliminate  zero  entries.  A  more  sophisticated  methodology 
would  be  to  replace  the  zeroes  in  the  table  with  the  expected  values  of  the  cells  under  some 
model.  Deming  and  Stephan  [13]  introduced  the  iterative  proportional  fitting  procedure  for 
fitting  the  table  with  the  maximum  likelihood  estimates  of  the  expected  cell  frequency.  The 
procedure  has  the  advantage  over  Berkson’s  suggestion  of  being  less  arbitrary,  but  relies 
very  much  on  the  parametric  model  assumed.  Both  methods  may  be  attacked  on  a  general 
philosophical  ground  that  analyzing  data  that  has  been  augmented  by  addition  of  values 
of  either  an  arbitrary  nature  or  by  assumption  of  a  specific  parametric  model  should  be 
avoided  whenever  possible. 

There  have  been  several  successful  strategies  proposed  for  structural  zeroes.  By  and 
large  these  have  been  the  classical  techniques  with  minor  modifications  to  account  for  the 
zeroes.  To  simplify  the  discussion  the  remainder  of  the  chapter  will  consider  only  R  x  C 
tables. 

One  type  of  table  with  structural  zeroes  which  occurred  early  in  the  history  of  contin¬ 
gency  tables  was  separable  tables  [21].  A  table  having  unordered  categories  and  structural 
zeroes  is  said  to  be  separable  if  it  is  possible  to  reorder  the  categories  of  the  predictors  in 
order  to  obtain  a  table  having  block  diagonal  form;  that  is  if  there  exists  a  partition  of 
the  categories  of  the  first  predictor  and  a  partition  of  the  categories  of  the  second 

predictor  Xf,  =  0  if  r  G  A,-,  a  €  By,  with  t  5^  j.  In  such  cases 

the  analysis  reduces  to  the  analysis  of  n  contingency  sub-tables  which  are  asymptotically 
independent.  Standard  techniques  then  can  be  applied  to  the  individual  tables  [21]. 


Section  1.2:  Brief  Overview  9 


A  second  approach,  applied  to  non*separable  tables,  extending  the  log-linear  model 
(1.2.6)  is  the  concept  of  quasi-independence,  due  to  Goodman  [21].  Let  S  =  {{i,j)  \  JT.y  #  0}. 
The  probabilities  are  written 


log  3r<y  =  -f- -h  for(*,y)€5  (1-2.9) 

with  the  constraints 


E  V=  E 

(»J)€S  («J)€5 


(1.2.10) 


The  model  of  quasi-independence  in  this  case  can  be  expressed  as  =  0  for  (t,  j)  £  5. 
Methods  corresponding  to  the  analysis  of  tables  lacking  structiiral  zeroes  can  then  be  applied 
(see  [14]  for  example). 

Recently  several  methods  have  appeared  in  the  literature  to  estimate  the  means  of 
large  contingency  tables  which  have  many  empty  cells  due  to  sampling  variations.  These 
methods  are  based  on  the  assumption  that  the  categories  of  the  table  are  ordered  so  that 
there  is  a  smooth  transition  of  cell  probabilities  as  one  proceeds  along  any  row  or  column. 
Fienberg  and  Holland  [16]  analyzed  sparse  tables  in  a  Bayesian  framework  and  considered 
estimators  based  on  a  Dirichlet  prior  and  a  squared  -  error  loss  function.  The  smoothness 
condition  was  expressed  by  having  the  cell  means  satisfy  a  functional  equation  involving  a 
function  of  the  row  and  column  number. 

More  recently,  Simonoff  [29]  has  adapted  the  maximum  penalized  likelihood  methodol¬ 
ogy  of  density  estimation  to  the  estimation  of  cell  means  in  large  sparse  contingency  tables. 
Assuming  a  multinomial  likelihood  for  the  cells,  the  estimates  of  the  cell  means  are  those 
values  which  maximize  the  multinomial  likelihood  minus  some  penalty  which  measiues  the 
^roughness’’  of  the  estimates.  Unfortunately,  implementation  of  the  procedure  to  large 
tables  is  difficult  and  the  behavior  not  well  understood  in  these  cases. 
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One  final  methodology  considered  which  can  be  used  to  relate  the  behavior  of  neigh¬ 
boring  cells  to  each  other  is  the  method  of  scoring.  This  methodology  does  not  appear 
often  in  the  statistics  literature  but  is  not  uncommon  in  other  fields.  Fisher  [17]  is  usually 
credited  with  the  first  use  of  scoring.  The  approach  is  to  assign  some  values  to  the  cate¬ 
gorical  predictors  in  order  to  maximize  the  correlation  between  the  predictors.  A  detailed 
description  can  be  found  in  Kendall  and  Stuart  [23].  The  approach  followed  by  the  psy¬ 
chometricians  is  different.  The  philosophy  of  theirs  is  closely  related  to  Pearson’s  ideas  of 
categorical  variables  in  contingency  tables.  The  purpose  of  optimal  scoring,  as  it  is  called  in 
the  literature,  is  to  assign  to  the  categories  of  the  predictors  a  value  corresponding  to  some 
unseen  metric.  These  values,  called  scores,  relate  the  discrete  variables  to  some  continuous 
measurement.  Once  assigned,  the  scored  variables  are  treated  as  continuous  random  vari¬ 
ables  and  standard  techniques  can  be  applied  to  them.  The  term  ^optimal”  comes  from  the 
fact  that  the  scores  are  assigned  as  to  optimize  the  fit  between  the  scored  variables  and  the 
model  fitted.  A  laborious  and  detailed  theory  of  the  above  may  be  found  in  [31]. 


Chapter  2 


The  Algorithm 


Given  a  mi  x  m2  x  •••mn  contingency  table  with  as  the  (ti,t2, . . . 

entry,  a  relatively  flexible  model  which  may  be  hypothesized  is 

n 

i=l 

where  Sy  are  scores  and  ^(•)  is  a  ^smooth”  function.  Such  a  model  can  be  fitted  by  adapted 
version  of  a  general  procedure  called  P.ACE,  which  is  itself  an  adaptation  of  another 
procedure  called  ACE. 

The  first  section  of  this  chapter  briefly  describes  the  P.ACE  algorithm,  as  well  as 
its  predecessor  ACE.  The  second  section  describes  in  detail  the  model  and  motivates  the 
algorithm  used. 


2.1.  ACE  and  race. 

Alfred  R6nyi  [27],  in  1959  considered  the  problem  of  defining  the  dependence  of  two 
random  variables  (X,y).  Consideration  of  certain  natural  conditions  which  he  felt  such 
a  measure  should  possess  led  to  the  concept  of  maximal  correlation  between  to  be 

the  definition  he  chose.  The  maximal  correlation  between  two  random  variables  ri  and  u  is 
defined  as 
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S(»7,i/)  =  supCor(/(r/),i^(i/)), 
f,g 

where  the  sup  is  taken  over  all  functions  for  which  the  left  hand  side  is  defined. 

In  the  paper  necessary  conditions  for  the  existence  of  two  functions  /  and  g  such  that 
S(»7,  i/)  =  Cor(/(»7),y(i/))  were  proved.  Such  functions  can  be  thought  of  expressing  the 
natiural  scale  when  considering  the  relationship  between  (X,  F)  since  f(X)  and  g{Y)  are  in 
some  sense  as  similar  as  possible.  No  attempt  was  made,  however,  to  establish  a  method 
for  determining  /  and  g. 

In  1982  Breiman  and  Friedman  [10]  considered  the  following  generalization  of  of  this 
problem: 

Given  Jfi, . . . , Xp,y  S,P)  determine  and  t?  which  minimize 

E(i?(y)  —  respect  to  all  square  integrable  functionals  subject  to 

the  constraint  E(i?^(y))  =  1. 

They  were  able  to  prove  existence  and  uniqueness  of  such  functions  and  developed  a 
procedure  of  determining  these  functions.  Their  procedure,  christened  ACE  for  Alterna¬ 
ting  Conditional  Expectation,  is  an  iterative  procedure  yielding  a  sequence  of  functions 
. . . ,  which  converge  to  the  solution  under  some  regularity  conditions. 

A  related  problem  may  be  posed:  Given  F,  Xxj  .,.,Xp  as  before^  determine  $,  ^i,. ,  .,^p 
which  minimize 


(2.1.1) 


Such  a  problem  may  occur  when  the  F  variable  is  thought  of  as  a  response  and  the  vector 
of  X  variables  are  thought  of  as  predictors,  and  it  is  desired  to  predict  the  value  of  F 
from  the  X^s,  The  ACE  algorithm  yields  a  prediction  rule  for  the  transformed  F,  which 
may  not  be  optimal  for  predicting  the  untransformed  response.  The  naive  approach  to 
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this  problem  is  to  apply  the  ACE  algorithm  to  determine  i?,  and  which  minimize 

E  (t?(y)  —  define  $  =  and  =  v?,-  for  t  =  For  noninvertible 

t?  this  methodology  is  of  little  value.  Further,  given  the  existence  of  this  approach  in 
general  does  not  yield  the  desired  result.  If  and  minimize  both 


and  E(y-<?(^^.(X.))y  (2.1.2) 


then 


r^(E(y  I  =  Eir^Y)  |  E W^.)).  (2.1.3) 

1=1  «=i 

This  is  in  general  not  true.  In  particular  if  ff  is  either  non-linear  concave  or  convex  Jensen’s 
inequality  prevents  (2.1.3)  from  occurring. 

A  solution  to  this  prediction  problem,  known  as  P.ACE  for  Predictive  ACE,  was 
proposed  by  Friedman  and  Owen  [19].  As  with  ACE  it  is  an  iterative  procedure  and  is 
outlined  below. 


Set  = 


=  id; 


Set  t  =  1 . 

Iterate  until  convergence  oi  0^*^,  '“4>p^ 

^(‘+i)(y)  is  a  “smooth"  ol  Y  on 

=  those  functions  which  minimize  E^y  — 


i  i-»  i  -I-  1 


End  outer  loop  iteration 


The  second  step  of  the  inner-loop  is  implemented  by  a  procedure  which  alternately 
updates  each  ^  while  fixing  the  other  functions,  and  continues  \mtil  each  ^  function 
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converges.  A  special  case  of  this  method  will  be  used  to  analyze  contingency  tables  and  is 
discussed  in  detail  in  the  next  section. 


2.2.  Notation  and  Description  of  Algorithm. 

The  notation  for  the  remainder  of  the  work  is  now  defined.  Let  {Y,Xi,...,Xp)  = 
(y,X)  be  random  variables  on  some  probability  space  Let  Xi  be  categorical 

variables,  with  X,-:  fl  {1, 2, , n,}  where  n,-  <  oo  for  for  *  =  1, . . .  ,p.  Associated  with 
each  categorical  variable  is  a  scoring  with  the  constraint  that  the  sum  of  the  scores  be  equal 
to  zero  .  Let  py  =  {/:N  IR  |  /(*)  =  0}*  Then  every  S  €  p„..  is  a  scoring  for  Xf. 

An  additional  constraint  may  be  imposed  if  Xi  is  an  ordered  categorical  variable.  If  -<  is 

the  order  relation  among;st  the  categories  {1, _ ,  n,-}  of  Xi  then  the  scores  S  should  satisfy 

S(i)  <  S(j)  for  i  -<  j. 

No  assumptions  are  made  concerning  the  joint  distribution  of  X.  The  model  postulated 
is 

E(Y\Xi  =  xi,...,Xp  =  xp)  =  f^Ojif^siiXi)),  (2.2.1) 

/=i  .=1 

where  m  is  finite  and  Sy  €  Pn,-  ^y(0  are  elements  from  a  class  of  functions  which  are 
constrained  only  by  a  vague  concept  of  “smoothness” .  Some  scaling  of  the  scopes  is  required 
to  insure  that  the  scores  and  the  function  are  well-defined.  The  scaling  used  by  the  author 

isllES.(JC.-)||oo  =  l. 

Two  particular  instances  of  this  model  are  now  presented.  The  first  example  is  the 
model 


=  E'<(Esj(x)). 


y=i  .=1 


(2.2.I0) 
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In  this  case  no  error  is  assumed  in  the  formulation  of  the  model  and  any  deviation  of 
the  data  from  the  model  is  due  exclusively  to  sampling  errors.  Such  a  model  arises  when 

=  tp)  and  the  elements  of  the  table  are  the  proportions  of 
counts  falling  in  that  position.  In  this  particulcir  case  the  constraints  >  0  and 

E, =  1  are  imposed. 

On  the  other  hand,  if  the  Yu,...  are  rates  corresponding  to  those  elements  of  the 
population  having  characteristics  (Xi  =  ti,...  ,Xp  =  Sp),  then  a  more  appropriate  model 
might  be 

m  p 

i=l  i=l 

where  it  is  assumed  that  the  e  is  independent  of  (y,X),  with  E(t)  =  0.  Although  the 
previous  model  could  be  thought  of  as  a  special  case  of  this  with  e  representing  the  sampling 
error,  this  is  not  the  intent.  In  theory,  data  collected  from  model  (2.2.1b)  could  have  several 
observations  corresponding  to  the  same  predictor  variables  while  in  model  (2.2.1a)  at  most 
one  observations  could  ever  arise  for  a  particular  set  of  predictor  variables. 

A  particular  instance  of  the  random  variables  {Y,  X)  is  a  finite  set  of  points  in  5R  x 
denoted  by  (Yi,  =  {Yi,Xi^i, . .  •  >Xp,i)"_j.  It  is  desired  to  fit  to  this  instance  a  model 

of  the  form  (2.2.1a)  or  (2.2.1b).  For  the  time  being  it  will  be  assumed  that  m  =  1  so  that 
the  outer  sum  contains  exactly  one  summand.  The  model  is  then  fitted  by  choosing  scores 
Si  and  a  function  9  which  minimize 

>■))).  (2.2.2) 

t=i  '  y=i  ' 

A  variant  of  the  P.ACE  algorithm  mentioned  in  the  previous  section  is  used  to  fit 
the  model.  The  procedure  iteratively  updates  the  function  tf(-)  and  the  scores  Sy  until 
convergence.  For  the  scores  fixed  9(-)  is  determined  by  “smoothing”  the  data  Yi  on  the  sum 
of  the  scores  Sy(Xy).  Readers  unfamiliar  with  smoothing  procedures  will  find  a  short 
discussion  in  the  appendix. 
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For  ^(*)  fixed,  determining  the  scores  Sy  is  a  large  non-linear  minimization  problem. 
A  direct  attack  is  expensive  and  consequently  a  simple  “approximate”  solution  is  chosen. 
It  is  sufficient  to  consider  optimizing  Si(Xi).  Let  7,-  =  {j  :  XiJ  =  i}.  Then 


EfW  -  =  EE{«  -  "(ES/Wv))}’-  (2-2-3) 

t'  /. 

Consequently,  to  minimize  the  left  hand  side  it  is  sufficient  to  minimize  each  summand  on 
the  right  hand  side.  Prom  symmetry  it  is  evident  that  only  one  term  ,  which  for  definiteness 
sake  is  taken  to  be  7i,  is  needed  to  be  analyzed  in  detail.  There  are  two  approaches  which 
will  yield  the  desired  result.  The  first  method  is  a  simple  application  of  calculus  and  is  left 
as  an  exercise  for  the  reader.  The  second  method  is  more  convoluted  and  obscure.  It  is 
the  second  method  which  will  be  illustrated,  not  because  of  any  perversity  of  the  author 
but  rather  because  it  illustrates  the  original  P.ACE  procedure,  and  yields  a  result  that  is 
easily  interpreted. 

Let  Sy  be  some  set  of  scores,  Wi  =  ^  Sy(Xi j),  Sj  be  the  value  of  the  optimal  score  for 
the  first  predictor  variable  at  the  point  1,  and  S  —  S\  —  Si(l).  The  problem  thus  reduces 
to  finding  the  value  of  S  which  minimizes 

+  (2.2.4) 

/l 

Assume  ${•)  €  C*  so  that  ${v)i  +  5)  =  +  6  +  0(S^).  For  each  »  let  to  be 

that  value  which  minimizes  (Yf  —  ${v>{  +  Then  (Yi  —  +  S-))^  is  minimized 

over  all  sets  of  numbers  and  all  that  remains  is  to  select  a  single  value  for  S.  To 

simplify  notation  let  ei  =  Y  —  ${wi  +  S-),  and  m,-  =  +  S^). 


(Y,-  -  6(wi  +  (Yi  —  ff(w{  +  Si)  +  $(wi  +  Sf)  —  0(wi  + 

/i  It 


=  (e.-  +  Hv>i  +  S,*)  -  +  s:)  +  (S-  S-)  m,-  +  0(S  -  S:)]f 

It 
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= + 2  - «;) + E(«  -  «/)"”■? + -  «;n 

h  /i  h 

/l  /l  fl 

/i  /i 


(а) 

(б) 


Equation  (a)  results  from  a  flagrant  abuse  of  O(-)  notation,  whereas  equation  (b)  is  a  valid 
consequence  of  the  fact  that  e,m,-  =  0  since  S/  minimizes  ((¥{  —  +  It  is  clear  that 

to  minimize  this  it  is  necessary  and  sufficient  that  0  =  ~  ^i)  >  or  equivalently 


(2.2.5) 


It  is  noted  that  the  optimal  “global”  increment  is  a  weighted  average  of  the  optim^ll 
“local”  increments.  For  mf  small,  a  small  deviation  from  5?  and  consequently  a  small  chsmge 
in  the  function  tf(*)  results  in  a  small  increase  of  the  squared  error.  Conversely  for  m?  large, 
a  small  deviation  from  results  in  a  large  increase  of  the  squared  error.  Consequently,  a 
rational  procedure  would  down-weight  those  values  for  which  mf  is  small  and  give  greater 
weight  to  those  values  for  which  m?  is  large.  The  updating  algorithm  (2.2.5)  does  precisely 
this. 


Apart  from  ease  of  interpretation  this  solution  is  less  than  optimal  for  updating  the 
scores.  The  determination  of  is  time  consuming,  appearing  directly  in  the  numerator 
and  indirectly  in  the  denominator  of  (2.2.5).  To  simplify  even  further,  the  ubiquitous 
Taylor’s  theorem  is  invoked  twice.  Without  loss  of  generality  it  may  be  assumed  that 
m,-  0.  Hence  0  =  I’i  —  tf(t»,-  -i-  S^)  =  Y,-  —  {fi(wi)  -1-  Sf  0'{wi)  -I-  0([#?]*)},  which  with 

mi  =  r(u>i  +  =  6'{wi)  -h  0{Sf)  yields  Yi  —  0(tOi)  =  mi  6-  +  0([tf,*]^).  Substituting  these 

two  expressions  into  (2.2.5)  and  after  some  sleight  of  hand  involving  the  further  misuse  of 
O(-)  one  arrives  at 
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s!  =  s.(l)  +  -  ‘'(ESy(X,v))]  «'(ES/(Xo)) /  E[«'(Es,(J:„))]"  (2.2.6) 

/i  y=i  y=i  '  /i  y=i 

Despite  its  appearance,  this  updating  algorithm  is  easily  implemented.  The  term 
Yi  —  0  (l^y=ri  Sy(-^«\y))  merely  the  residual  of  the  model  from  the  observation  and  can 
computed  when  calculating  ^(•).  As  explained  in  the  appendix  0^  likewise 

can  be  estimated  at  this  time.  The  difference  between  the  solution  (2.2.6)  and  the  solution 
of  (2.2.5)  is  of  order  0{S  -  6*)  which  is  small  when  the  algorithm  is  close  to  the  true 
minimizing  functions. 

No  order  constraints  for  the  scores  corresponding  to  ordered  categorical  variables  have 
yet  been  imposed.  H  alternately  smoothing  and  applying  the  updating  algorithm  (2.2.6) 
yields  as  the  result  scores  with  the  correct  order,  then  the  constrained  problem  coincides 
with  the  unconstrained  problem  and  no  difficulty  arises.  If,  on  the  other  hand,  this  is  not 
the  case  the  offending  scores  must  be  coerced  into  submission.  This  is  accomplished  by  use 
of  the  pool  adjacent  violators  algorithm.  The  method  is  discussed  in  detail  in  [2].  Since  the 
order  of  the  scores  the  procedure  finds  yields  information  about  the  data  this  approach  of 
forcing  a  particular  ordering  of  the  scores  will  not  be  considered  here. 

The  algorithm  is  then  continued  with  the  new  scores  and  categories  until  convergence. 
The  procedure  of  pooling  the  scores  for  ordered  variables  if  necessary  and  continuing  the 
algorithm  continues  until  the  procedure  converges  with  the  scores  of  ordered  categories 
obeying  the  order  constraints. 

Thus  a  method  of  determining  scores  and  a  function  to  minimize  (2.2.2)  is  determined 
and  all  that  remains  is  to  extend  this  method  for  determining  functions  and  scores  to 
minimize 


E(n-E',(Es.“».)))'. 

'  y=i  k~i  ' 


(2.2.7) 


where  N  is  some  small  number. 
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To  this  end  a  greedy  algorithm  is  employed  in  conjunction  with  the  procedure  outline 
above.  Let  . . . ,  be  the  function  and  scores  obtained  by  minimizing  (2.2.2). 

Define  a  new  set  of  response  variables  Yi  =  Yi  —  h  yfoTi  —  l,...,n. 

The  second  set  of  functions  and  scores  are  then  determined  according  to  the  methodology 
described  above  using  the  residuals  from  the  first  fit  as  the  new  response  variable.  This 
procedure  of  replacing  the  response  variable  with  the  residuals  from  the  previous  fit  and 
finding  the  minimizing  function  and  scores  for  (2.2.2)  yields  the  sequence  of  functions  and 
scores,  and  is  repeated  until  it  is  deemed  that  no  acceptable  improvement  in  the  model  is 
obtained.  In  practice,  however,  it  has  been  found  in  the  examples  tried  so  far  that  one 
summand  is  sufficient  to  capture  most  of  the  structure  in  the  data. 

Further  modifications  can  be  made  to  the  algorithm.  The  most  obvious  corresponds 
to  the  method  of  obtaining  the  smooth  functions  ^(*).  Various  smoothing  algorithms  are 
present,  although  experience  has  shown  this  to  have  minor  impact.  Robust  smoothers 
can  be  utilized,  as  well  as  monotone  smoothers,  or  requiring  the  smooths  to  lie  in  some 
parametric  family.  Except  for  a  brief  discussion  in  the  following  chapter  these  issues  will 
not  be  addressed  in  any  detail,  and  the  reader  is  invited  to  implement  any  or  all  of  the 
above,  according  to  his/her  indiscretion.  The  actual  smoother  implement  in  the  examples 
included  here  is  a  variable  span  smoother  having  three  separate  band-widths,  each  being 
fitted  to  the  data  by  a  local  least-squares  algorithm  [18]. 

To  summarize,  the  algorithm  used  to  fit  (2.2.1a)  and  (2.2.1b)  is  as  follows: 


Set  «  =  1 

While  sum  of  residuals  squared  shows  large  enough  decrease 
=  =  =  id 

“1,1  ”p,» 

Set  k  °  1 


Iterate  until  (2.2.2)  fails  to  decrease 

^(*+i)(y)  _  gjiootjj  Qf  Y  on 

updated  according  to  (2.2.6) 
End  innermost  loop 
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Set  0i  =  0f.  Sy,.  =  s{5 
Set  r  =  r-Esy,.(Xy) 

»  t-f  *  +  1 

End  outer  loop 


Chapter  3 


Some  comparisons 


Whenever  a  new  method  is  suggested  which  either  extends  or  generalizes  existing 
procedures  an  important  question  is  its  performance  in  cases  when  the  previous  procedures 
yield  satisfactory  results.  If  the  new  method  fails  in  these  cases  or  yields  results  differing 
significantly  from  the  previously  found  results,  some  question  of  value  is  raised.  It  is  clearly 
not  possible  to  examine  all  cases  in  which  the  classical  procedures  have  analyzed  tables 
successfully,  and  yet  reasonable  performance  on  just  a  few  cases  is  sufficient  to  establish 
some  basis  of  trust  in  it,  especially  if  in  these  cases  the  results  of  the  old  and  the  new  are 
comparable.  In  what  follows,  several  “real”  data  sets  are  extracted  from  various  sources  for 
which  the  classic  model  seems  appropriate  and  the  results  are  compared  to  the  results  of 
the  P.ACE  algorithm. 


As  described  in  the  previous  section,  different  results  for  P.ACE  may  be  obtained  by 
use  of  different  smoothers.  In  particular,  a  type  of  parametric  P.ACE  may  be  obtained  if  the 
smooths  are  constrained  to  lie  in  some  parametric  family.  As  stated  earlier,  this  parametric 
P.ACE  procedure  is  not  advocated.  However,  the  relationship  between  P.ACE  and  some 
of  the  classic  procedures  is  most  easily  seen  by  use  of  this  parametric  P.ACE  procedure  as 
an  intermediary.  In  four  examples  a  classical  model  is  shown  to  be  related  to  the  P.ACE 
procedure  in  which  the  curve  is  forced  to  lie  in  some  parametric  family.  If  the  range  of  the 
non-parametric  smoother  of  the  P.ACE  procedure  encompasses  this  parametric  family 
then  it  can  be  deduced  that  the  P.ACE  procedure  generalizes  some  variant  of  the  classical 
procedure. 
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3.1.  Exponential  models. 

The  first  family  is  a  two  parameter  family  of  exponentials,  Fe  =  {ae**|o,  6  €  IR}.  The 
parameter  6  is  included  because  of  the  scale  constraint  on  the  scores.  To  simplify  notation 
assume  the  table  imder  consideration  is  an  r  x  s  x  t  table.  Letting 


'  log  Pi, •  =  6Si(t)  ' 

<  logpay  =  ftSaO)  > 
.  log  Pzk  =  bS3{k)  , 

the  model  corresponding  to  this  family  is 


(3.1.1) 


Yijjt  =  oexp{6  (Si(»)  +  82(7)  +  S3(i))}  =  apiiP2jPzk  (3.1.2) 

where  pu,  p2y,  and  ps*  satisfy  X),-  log  Pi,-  =  X)y  log  Pij  =  E*  log  Pst  =  0  Vi,  j,  and  k  and 
are  chosen  to  minimize 


5^(Tiyt  -  opi.Pzy  Pst)*  (3.1.3) 

This  corresponds  to  the  model  of  independence  and  is  closely  related  to  Birch’s  method 
of  analyzing  contingency  tables  having  no  higher  order  interactions  [6].  This  method  of 
estimation  consists  of  taking  logarithms  of  the  response  variable  and  applying  standard 
AN OVA  constraints  to  the  transformed  data.  In  the  case  of  independence  this  reduces  to 
the  model 


log  Yijk  =  lu  +  l2j  +  7s*  +  c 


(3.1.4) 


where  X2,-  7i,‘  =  Sy  72;  =  S*  73*  =  0  Vi,  j,  k.  When  the  estimates  are  given  by 
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Ijk  —  log  1 


•>=* 


»>=* 


they  can  be  shown  to  minimize 


(3.1.5) 


^(log  Yijit  -  'in  -  72>  -  Izk  -  c)*-  (3.1.6) 

Another  method  to  fit  this  model  is  to  use  the  maximum  likelihood  estimates.  In  the  case 
of  independence  the  estimates  are  given  by 


Ijk  —  ^  l)  (3.1.7) 

It  is  seen  that  using  the  first  method  of  estimation,  models  (3.1.2)  and  (3.1.4)  differ 
only  in  that  model  (3.1.4)  minimizes  on  a  log  transformed  scale  while  the  proposed  model 
(3.1.2)  minimizes  on  the  original  scale.  Consequently,  in  the  case  where  model  (3.1.4)  is 
appropriate  one  might  expect  that  they  yield  similar  results.  The  estimates  given  by  (3.1.5) 
can  be  considered  as  the  average  of  the  log  of  the  responses  while  the  estimates  given  by 
(3.1.7)  are  the  log  of  the  average  of  the  responses.  Although  Jensen’s  inequality  forces  the 
estimates  to  differ,  if  the  log  is  semi-linear  in  the  region  of  interest  the  estimates  should  be 
similar,  and  consequently  the  the  results  of  models  (3.1.2)  and  (3.1.6)  should  be  similar. 

As  an  example  a  data  set  was  analyzed  using  the  P.ACE  algorithm  ,  the  Birch  method, 
and  the  maximum  likelihood  method.  The  data  consisted  of  60  observations  taken  from  the 
U.N.  Demographic  Yearbook  1980  [30].  The  response  variable  was  German  infant  mortality 
rate  and  the  predictors  consisted  of  four  categorical  variables  described  on  page  33.  Also 
listed  on  that  page  are  the  scores  derived  from  the  Birch  analysis  {labeled  Anova),  from 
the  maximum  likelihood  method,  (iabeied  ML.),  and  from  the  P.ACE  procedure  {labeled 
PACE). 

The  following  two  pages  contain  plots  of  the  corresponding  curves.  On  page  34  the 
sum  of  scores  corresponding  to  the  Anova  analysis  is  plotted  in  plusses  against  the  response 


Section  3,2:  Linear  models  24 


variable  and  the  corresponding  exponential  curve  is  drawn  through  using  dotted  lines.  Also 
on  the  graph  are  the  sum  of  scores  corresponding  to  the  PACE  procedure  plotted  in  circles 
with  the  estimated  curve  plotted  in  solid  lines.  On  page  35  the  curve  corresponding  to  the 
P.ACE  analysis  is  reproduced.  The  sum  of  scores  of  the  maximum  likelihood  procedure  is 
plotted  as  plusses  and  the  dotted  line  represents  the  corresponding  exponential  curve.  As 
can  be  seen  from  the  following  pages  the  scores  are  all  very  similar  to  each  other,  and  the 
curves  corresponding  to  them  are  nearly  identical. 

Since  all  three  curves  are  monotone  increasing,  large  scores  correspond  to  increased 
mort2dity  rates  while  small  scores  correspond  to  decreased  mortality  rates.  In  all  three 
procedures  the  ordering  of  the  scores  within  each  category  is  the  same,  indicating  agreement 
of  the  procedures  in  the  relative  ranking  of  attributes  related  to  increase  mortality.  The 
largest  and  smallest  scores  occur  in  the  category  of  age  at  death,  indicating  that  this  is  the 
most  important  factor  in  determining  early  and  late  mortality.  The  period  of  elapsed  time 
involved  in  the  different  age  groups  vary,  making  direct  comparisons  difficult.  It  should  be 
noted  that  the  two  periods  of  smallest  time  elapsed  {less  than  one  day  and  between  one 
and  six  days)  have  the  largest  scores,  indicating  that  these  periods  of  time  are  at  greater 
risk  than  any  other  period  involved,  while  the  category  consisting  with  the  largest  amount 
of  elapsed  time  (five  to  eleven  months)  has  the  smallest  scores,  indicating  the  smallest  risk 
group.  Both  country  of  birth  and  sex  of  child  follow  as  important  factors,  with  scores  of 
approximately  equal  magnitudes.  The  scores  corresponding  to  the  year  of  birth  are  very 
small,  indicating  that  year  of  birth  is  the  least  informative  category. 


3.2.  Linear  models. 

The  second  family  consisted  of  the  two  parameter  family  of  linear  functions,  Fi  — 
{ax  +  6  I  a, 6  €  92}.  The  parameters  a  and  b  are  required  because  of  the  scaling  of  the 
scores.  The  model  corresponding  to  this  family  is 
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=  Es.(v)  +  ‘  («1) 

Such  a  model  can  be  thought  of  arising  from  taking  logarithms  of  (3.1.2)  with  the  re¬ 
sponse  in  (3.2.1)  corresponding  to  the  logarithm  of  the  response  in  (3.1,2).  The  linear  model, 
however,  arises  naturally  in  several  untransformed  models.  This  particular  restriction  of  the 
smoother  corresponds  exactly  to  a  procedure  described  by  Yoimg  [31].  The  method  sug¬ 
gested  for  fitting  such  models  was  to  alternate  between  optimizing  the  scores  and  the  linear 
fit,  a  procedure  corresponding  exactly  in  spirit,  if  not  in  detail,  to  the  P.ACE  procedure. 
Consequently,  P.ACE  can  be  thought  of  as  a  direct  generalization  of  this  method. 

As  an  example  a  data  set  was  analyzed  using  the  P.ACE  algorithm  in  which  the 
smooth  functions  were  constrained  to  be  linear  and  the  unmodified  P.ACE  procedure. 
The  data  consisted  of  140  observations  taken  from  the  U.  N.  Demographic  Yearbook  [30]. 
The  response  variable  was  expected  number  of  years  prior  to  death  and  the  predictors 
consisted  of  the  nationality,  age,  and  sex  of  the  subject  as  well  as  the  year  of  the  estimate. 
These  can  be  found  on  page  36,  as  well  as  the  scores  derived  from  both  models. 

On  the  following  page  the  sum  of  the  scores  corresponding  to  the  linear  model  is  plotted 
in  plusses  against  the  response  variable  and  the  corresponding  straight  line  is  drawn  using 
dotted  lines.  Also  on  the  same  graph  are  the  scores  corresponding  to  the  unmodified  P.ACE 
procedure  drawn  in  circles  and  the  corresponding  curve  drawn  using  a  solid  line.  As  can  be 
seen  from  the  graph  and  the  table,  the  two  sets  of  scores  are  very  similar  and  not  surprisingly 
the  curves  are  also  quite  similar. 

Large  scores  correspond  to  a  decrease  in  life  expectancy,  while  small  scores  correspond 
to  an  increase.  With  one  exception  {Cuba  versus  Italy  )  the  ordering  within  each  category 
is  the  same,  indicating  that  both  procedures  agree  in  the  relative  effect  and  importance  of 
the  different  categories.  As  was  to  be  expected,  age  was  the  most  important  factor.  The 
next  overall  important  variable  was  sex,  indicating  that  females  tended  to  have  a  definite 
advantage  over  males  in  terms  of  life  expectancy.  Much  less  important  was  the  year  of  the 
estimate,  and  almost  irrelevant  was  the  country.  Since  it  was  the  coimtry  variable  which 
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displayed  the  difference  in  ordering,  not  too  much  importance  should  be  placed  on  this 
aberration.  Overall,  the  fit  seems  quite  close. 


3.3.  Box-Cox  models. 

A  family  of  functions  of  transformations  was  introduced  by  Box  and  Cox  in  1974  [9]. 
The  family,  now  known  as  Box-Cox  transformations,  is  a  one  parameter  family  of  curves 
defined  by  Fob  =  {/a(*)  =  |  A  €  SR},  with  the  condition  that  /o(:c)  =  — 

ln(2).  Given  a  sequence  of  predictor  and  response  variables  suggested 

that  a  linear  fit  be  found  to  the  transformed  sequence  {(x,*, /\(yt))}^ij  where  A  is  chosen 
to  maximize  the  fit.  Thus  the  model  fitted  is 


fx{Y)  =  ao  +  ^  ay  fx  €  Fcb  (3.3.1) 

This  suggests  that  an  appropriate  class  of  parametric  curves  to  consider  in  the  P.  ACE 
methodology  may  be  the  three  parameter  family  of  scale  and  location  shifts  of  the  inverses 
of  the  Cox-Box  transformations,  namely  ^CB  =  +  I  )  €  Fcb}- 

It  should  be  noted  that  this  family  contains  a  large  class  of  functions,  including  both  the 
linear  and  exponential  curves  considered  earlier.  The  model  corresponding  to  this  family  of 
curves  can  be  written  as 


p 

. .-p  =  fx*  (“  +  ft  E  /a*  e  Fob  (3.3.2) 

Clearly  the  two  models  are  closely  related.  Superficially  it  appears  possible  to  trans¬ 
form  one  into  the  other  by  merely  inverting  the  function  fx{‘).  By  the  argument  in  chapter 
two,  however,  it  is  clear  that  one  would  not  expect  the  function  fx{‘)  of  (3.3.1)  to  be  the 
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inverse  of  the  function,  fx*{')y  of  equation  (3.3.2).  Except  in  cases  where  the  function  is 
linear  the  two  functions  correspond  to  different  values  of  A,  although  in  many  cases  it  would 
seem  reasonable  that  the  two  values  of  A  be  close. 

When  there  are  only  a  few  distinct  values  of  each  predictor  it  is  possible  to  consider 
the  predictors  to  be  categorical.  Writing 

«0  +  2  ajxj  =  (oo  +  ajXj)  +  6'  }  (3-3.3) 

where  6^  is  included  to  satisfy  whatever  prerequisite  scaling  is  being  invoked  for  scores,  it 
can  be  seen  that  the  centered  and  scaled  predictors  —  Xj)fb^  correspond  to  scores  in 
the  categorical  model  (3.3.2).  There  is  a  difference,  however,  between  the  scores  of  (3.3.2) 
and  the  “scores*  of  (3.3.3).  Whereas  the  scores  of  the  P.ACE  procedme  are^  chosen  as  to 
maximize  the  fit  to  the  model  and  are  subjected  only  to  scale  and  location  constraints,  the 
scores  of  (3.3.3)  are  constrained  to  be  scale  and  location  shifts  of  the  values  of  the  original 
predictors,  reducing  the  amount  of  fiexibility  in  the  model.  Letting  a'  =  ao  +  ^  ajxj  and 
S/{xi)  —  —  ®y)/6^  the  Box  Cox  model  can  be  written  as 


/x(y)  =  a'  +  6'X;S/(Xy)  (3.3.4) 

making  the  relationship  to  the  parametric  version  of  P.ACE  transparent. 

With  this  identification  it  is  possible  to  compare  the  two  procedures  on  a  data  set. 
The  data  examined  comes  from  the  Box  Cox  paper  [9].  The  data  resulted  from  attaching 
to  yam  of  three  different  lengths  three  different  sets  of  weights  and  subjecting  the  yam 
to  swings  of  three  different  amplitudes.  The  response  was  the  number  of  swings  prior  to 
the  breaking  of  the  yam.  Since  the  number  of  distinct  values  of  the  predictor  variables 
is  small,  it  is  possible  to  consider  them  as  categorical  and  use  the  P.ACE  procedure  on 
them.  FVom  the  discussion  above,  however,  it  can  be  seen  that  the  resulting  scores  from  the 
two  procedures  can  not  be  compared,  making  it  difficult  to  evaluate  the  two  procedures. 
To  eliminate  this  problem  the  Cox- Box  model  was  replaced  by  the  model  (3.3.2).  This 


Section  S.J^:  Logistic  models  28 


replacement  allowed  the  same  degree  of  freedom  in  determining  the  scores  and  had  both 
models  minimize  the  squared  error  between  the  fit  and  the  data  in  the  same  metric.  Given 
that  the  Box-Cox  model  and  the  model  (3.3.2)  are  similar,  this  approach  does  not  appear 
to  be  entirely  unreasonable. 

The  scores  of  the  two  procedures,  as  well  as  the  predictors  and  the  true  and  estimated 
values  of  the  response  of  both  models  can  be  foimd  on  page  38.  On  page  39  the  sum  of 
scores  corresponding  to  the  P,ACE  procedure  is  plotted  in  circles  against  the  response 
variable  and  the  fitted  line  is  drawn  using  a  solid  line.  The  sum  of  scores  corresponding 
to  the  pseudo- Box-Cox  model  is  plotted  in  plusses  against  the  response  on  the  same  graph 
and  the  corresponding  curve  is  plotted  in  a  dotted  line. 

Note  that  the  curve  and  the  estimates  agree  rather  closely.  The  curve  chosen  to 
maximize  the  fit  was  f(x)  —  The  inverse  of  this  is  the  function  f{y)  — 

which  is  in  close  agreement  with  the  transformation  Box  and  Cox  found  in  their  paper, 
namely  f(y)  —  Two  of  the  three  predictors,  length  and  load,  show  very  good 

agreement  in  the  scores,  and  in  all  cases  the  ordering  is  the  same. 


3.4.  Logistic  models. 

The  four  parameter  family  of  curves  Fio  =  {a  +  +  e**'^'^)\a,b,c,d  €  Si} 

encompass  a  large  number  of  curves  not  contained  in  the  families  previously  considered. 
When  the  response  variable  represents  the  probability,  P,  of  the  occurence  of  some  event 
given  a  specific  set  of  covariates  zi, . . . ,  Zk,  a  model  often  used  is  the  logistic  model, 


l  +  e*+E«.». 


(3.4.1) 


corresponding  to  the  sub-family  of  Fio  in  which  the  additive  term  is  zero  and  the  mid- 
tiplicative  term  is  one.  The  reason  for  the  popularity  of  the  model  (3.4.1)  is  due  to  an 
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equivalent  formulation, 


In 


P(xu. 


Xk) 

•,Xk) 


+  6 


(3.4.2) 


which  allows  a  linear  analysis  of  the  transformed  response.  As  noted  previously,  such  an 
analysis  will  not  be  optimal  when  predicting  responses  from  the  covariates  and  in  such  cases 
the  model  fitted  should  be  (3.4.1). 

Often  the  covariates  are  measurements  of  some  continuous  variable.  However,  as  in 
the  case  of  the  Box-Cox  models,  if  the  exists  only  a  few  distinct  values  of  the  covariates 
they  can  be  considered  categorical.  In  such  cases  the  linear  term  in  (3.4.1)  and  (3.4.2)  is 
replaced  by  a  location  and  scale  shift  of  sums  of  scores. 

To  demonstrate  an  instance  in  which  both  the  logistic  model  and  the  P.ACE  pro¬ 
cedure  seem  to  be  in  some  agreement  the  win-loss  record  of  the  American  baseball  league 
in  1948  was  examined  [1].  The  predictor  variables  consist  of  each  of  the  eight  teams  in 
the  American  league  and  the  response  for  the  t—  j—  entry  of  the  table  is  the  number  of 
games  team  t  won  against  team  j  that  season.  This  data  is  an  example  of  a  table  having 
structural  zeros  along  the  main  diagonal,  since  presumably  a  team  does  not  play  against 
itself.  Initially  there  appears  56  observations,  a  pair  of  wins  and  losses  corresponding  to 
each  pair  of  teams.  However,  with  four  exceptions  each  pair  of  teams  played  22  games, 
effectively  reducing  the  number  of  observations  to  28.  Given  such  a  symmetry  it  would  be 
reasonable  to  expect  any  model  based  on  the  data  to  reflect  this  symmetry  as  well. 

A  straight  forward  application  of  the  P.ACE  methodology  would  construct  a  model 
of  the  form 


^  of  wins  of  team  A  against  team  B  =  ^(St0tn(A)  —  S/^f,(B)).  (3.4.3) 

La  this  model  each  team  would  have  two  scores,  one  score  for  the  games  it  won  and  one 
score  for  the  games  it  lost.  A  reasonable  alternative  would  be  to  fit  the  model 
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^  of  wins  of  team  A  against  team  B  =  ^(S(A)  —  S(B))  (3.4.4) 


which  assign  a  single  score  to  each  team.  This  model  would  not  only  reflect  the  symmetry  of 
the  problem  but  also  reduces  the  number  of  estimated  parameters  by  8.  It  is  this  later  model 
that  is  fitted.  This  is  accomplished  by  fitting  model  (3.4.3)  with  the  additional  constraint 
that  Sy0iD(A)  =  S/o«#(A). 

The  logistic  model  fitted  was  of  the  form 


#  of  wins  of  team  A  against  team  B  =  0  x  i(S(A)  —  S(B))  (3.4.5) 


where  L  is  some  scale  and  location  shift  of  the  standard  logistic  curve  and  the  numbers 
a  and  0  are  chosen  to  maximize  the  fit.  In  both  models  the  scores  can  be  thought  of 
representing  some  measure  of  the  strength  of  the  teams  while  the  functions  represent  the 
means  to  translate  the  relative  strength  two  teams  into  the  expected  number  of  wins  for 
either  team. 

The  observed  values  and  the  estimates  from  both  models  may  be  found  on  page  40, 
as  well  as  the  scores  obtained  by  both  methods.  Page  41  contains  the  graph  of  the  sum  of 
scores  of  the  P.ACE  procedure  versus  the  observed  values  plotted  in  circles  with  the  curve 
estimated  by  the  procedure  drawn  in  a  solid  line.  On  the  same  graph  the  sum  of  scores 
of  the  logistic  fit  versus  the  observed  values  are  plotted  in  plusses  with  the  corresponding 
logistic  function  plotted  in  a  dotted  line.  The  two  curves  are  remarkably  similar.  The 
scores  are  comparable  as  are  both  fits  to  the  data.  The  ordering  of  both  scoring  systems 
is  the  same  and  coincides  with  the  final  standings  of  the  league  at  the  end  of  the  season, 
reinforcing  the  interpretation  that  the  scores  represent  some  measure  of  relative  strength  of 
the  teams. 
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3.5*  Comments. 

Four  different  parametric  models  were  considered  in  this  chapter;  the  exponential 
model,  the  linear  model,  models  arising  from  Box-Cox  transformations  and  the  logistic 
model.  In  each  case  the  P.ACE  model  was  shown  to  generalize  to  some  variant  of  the 
parametric  model.  For  each  of  the  parametric  models  a  data  set  was  selected  for  which 
that  particular  model  fitted  well.  However,  no  single  parametric  model  was  appropriate  for 
all  four  data  sets.  On  the  other  hand,  the  P.ACE  model  fitted  all  four  sets  well,  yielding 
results  similar  to  those  obtained  by  the  corresponding  parametric  models.  Thus  the  P.ACE 
procedure  has  the  advantage  of  freeing  the  investigator  from  having  to  postulate  possibly 
wrong  parametric  model  assumptions  without  the  concern  of  possibly  obtaining  very  dif¬ 
ferent  results  if  a  particular  parametric  model  is  appropriate.  Because  of  this  fiexibility  the 
P.ACE  model  may  be  used  as  a  model  selection  procedure  directing  the  researcher  towards 
a  specific  parametric  model. 

On  the  other  hand,  it  is  reasonable  to  believe  that  data  sets  exist  for  which  none  of 
the  standard  parametric  models  are  appropriate  while  the  general  P.ACE  model  fits  well, 
although  the  author  has  yet  encountered  such  a  set.  Thus  the  P.ACE  procedure  may  be 
viewed  as  a  method  of  fitting  a  non-parametric  model  of  a  contingency  table  without  further 
reference  to  any  parametric  model.  In  addition,  an  advantage  of  the  procedure  is  that  it 
appears  to  be  relatively  insensitive  to  missing  values  in  the  table.  This  aspect  is  discussed 
in  further  detail  in  the  next  chapter. 


3.6.  Graphs  and  Tables. 

This  section  contains  graphs  and  tables  of  the  data  considered  in  this  section.  As  a 
matter  of  consistency  the  following  conventions  have  been  been  followed.  Whenever  the 
P.ACE  procedure  is  compared  with  an  alternate  procedure 

1)  The  sum  of  scores  obtained  by  the  P.ACE  procedure  versus  the  observations  are 
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plotted  as  circles. 

2)  The  curve  obtained  by  the  P.ACE  procedure  is  plotted  as  a  solid  line 

3)  The  sum  of  scores  obtained  by  the  alternate  procedure  versus  the  observations  are 
plotted  as  plusses. 

4)  The  parametric  curve  obtained  by  the  alternate  procedure  is  plotted  as  a  dotted  line 
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German  Infant  Mortality  Data 


Predictor  variables 


1) 

country  of  birth 

0 

East  Germany 

1 

West  Germany 

2) 

year  of  death 

0 

1971 

1 

1972 

2 

1973 

3) 

sex  of  infant 

0 

male 

1 

female 

4) 

age  at  death 

0 

less  than  one  day 

1 

less  than  six  days 

2 

less  than  twenty  seven  days 

3 

less  than  five  months 

4 

less  than  eleven  months 

Scores 

predictor 

M.L. 

Anova 

RACE 

1 

-0.150 

-0.129 

-0.161 

0.150 

0.129 

0.161 

2 

0.025 

0.027 

0.036 

-0.004 

0.018 

0.006 

-0.021 

-0.046 

-0.042 

3 

-0.139 

-0.130 

-0.152 

0.139 

0.130 

0.152 

4 

0.567 

0.560 

0.561 

0.648 

0.595 

0.651 

-0.575 

-0.592 

-0.559 

0.050 

0.073 

-0.069 

-  0.690 

-0.636 

-0.584 
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German  Infant  Mortality  Data 
P.ACE  model  and  Anova  model 


to 


lO 


1 
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Life  Expectancy  Data 


predictor 

Linear 

RACE 

scores 

scores 

1)  cotmtrv 

Chile 

-  0.081 

-0.075 

Cuba 

0.069 

0.049 

Singapore 

-0.028 

-0.019 

Kuwait 

0.032 

0.028 

Mexico 

-  0.054 

-0.051 

Italy 

0.053 

0.053 

Scotland 

0.009 

0.015 

2)  ^ 

at  birth 

0.661 

0.627 

at  ten  years 

0.490 

0.489 

at  twenty-five  years 

0.137 

0.162 

at  fifty  years 

-0.439 

-0.420 

at  seventy-five  years 

-0.849 

-0.862 

3) 

male 

-0.542 

-0.502 

female 

0.542 

0.502 

4)  vear  of  estimate 

1970 

-  0.162 

-0.129 

1975 

0.162 

0.129 
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Yarn  Data 


Predictor  P.ACE 


scores 

Length 


250  mm 

-0.434 

300  mm 

-0.011 

350  mm 

0.444 

Amplitude 

8  mm 

0.405 

9  mm 

-0.144 

10  mm 

-0.261 

Load 

40  gm 

0.151 

50  gm 

0.045 

60  gm 

-0.195 

Box  -  Cox  model: 

=  (0.892 


Box-Cox 

scores 

-0.454 

0.039 

0.415 


0.330 

-0.001 

-0.329 


0.193 

0.024 

-0.217 


0.114  E/Sy  (»•/))■'"•’ 


♦ 
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Baseball  Data 


L  O  S  S  B  S 


WINS 

Cleveland 

Boston 

New  York 

PhiL 

St.  Louis 

Wash. 

Chicago 

a. 

11 

10.2  11.2 

■HI 

16 

13.1  13.4 

13 

13.2  14.1 

IBSHfR 

I198B9 

16 

15.4  15.2 

B. 

10 

11.4  10.8 

14 

12.5  11.3 

12 

13.5  13.2 

15 

15.3  15.1 

14 

15.4  15.2 

N.Y. 

8 

9.2  10.7 

12 

12.2  13.0 

13 

12.4  13.4 

17 

14.6  15.1 

16 

15.3  15.2 

P. 

6 

8.7  8.6 

10 

8.3  8.8 

10 

9.5  9.0 

WStB 

18 

13.7  14.6 

14 

13.9  14.8 

D. 

HD|[^ 

BSIE^S 

12 

10.7  9.8 

11 

13.4  14.0 

16 

13.6  14.4 

14 

14.9  14.7 

S.L. 

8 

6.9  6.9 

^SjHj 

BSli 

W. 

HHBI 

BSHIH 

^BRI 

11 

10.7  10.1 

Ch. 

6 

6.6  6.8 

8 

6.6  6.8 

6 

6.7  6.8 

6 

6.9  7.0 

8 

7.0  7.3 

8 

8.7  9.3 

9 

8.8  10.1 

observed 

value 

Pace  Logistic 
est.  est. 


Baseball 

RACE 

Logistic 

games 

Team 

scores 

scores 

won 

Cleveland 

0.434 

0.441 

96 

Boston 

0.362 

0.418 

95 

New  York 

0.240 

0.383 

95 

Philadelphia 

0.080 

0.160 

84 

Detroit 

0.051 

0.028 

78 

St.  Louis 

-0.287 

-0.370 

59 

Washington 

-0.313 

-0.463 

55 

Chicago 

-0.566 

-0.560 

51 
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Chapter  4 


Simulations 


The  algorithm  has  been  demonstrated  to  perform  reasonably  well  on  a  few  “real”  data  ex¬ 
amples.  It’s  performance  on  sparse  tables  has  yet  to  be  examined.  The  intent  of  this  chapter 
is  to  demonstrate  the  performance  of  the  procedure  under  several  different  conditions  using 
simulated  data. 


4.1.  The  Data. 


All  the  simulated  data  arose  from  a4x4x3x2x3  contingency  table.  This  table 
of  288  cells  was  considered  an  acceptable  compromise  between  tables  of  larger  sizer  which 
were  computationally  very  expensive  to  analyze  and  tables  of  smaller  size  which  could  be 
less  informative.  For  all  simulations  the  scores  were  fixed  and  were  as  follows: 
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predictor  scores 

1)  0.10 

-0.0333333 

-0.10 

0.0333333 

2)  -0.22 

-0.14 

0.02 

0.34 

3)  0.20 

0.20 

-0.40 

4)  -0.1597079 

0.1597079 

5)  -  .1202921 

-.08 

0.2002921  (4.1.1) 


The  scores  for  predictor  one  were  chosen  so  that  the  inner  two  scores  were  separated 
by  a  relatively  smedl  distance  while  the  outer  two  scores  were  at  a  much  greater  distance 
away.  The  intent  was  to  see  if  given  the  relatively  large  separation  of  the  outer  scores  said 
the  relatively  small  separation  of  the  inner  scores  if  the  P.ACE  procediire  could  distinguish 
and  separate  the  inner  scores.  The  scores  for  predictor  2  are  a  scale  and  location  shift  of 
the  sequence  8,  16,  32,  64.  The  geometrically  increasing  separation  of  consecutive  scores 
offered  another  method  of  assessing  the  procedure’s  ability  to  distinguish  within  categories 
scores.  Two  of  the  scores  of  predictor  3  were  chosen  to  be  equal  to  see  if  the  procedure 
would  be  able  to  detect  this  fact.  The  scores  for  predictor  4  were  random.  It  was  the 
author’s  intent  to  have  them  equal  to  t/20,  although  his  skill  in  basic  arithmetic  prevented 
this  from  occurring.  The  scores  for  predictor  5  were  what  is  known  as  a  “kludge” .  The 
algorithm  scales  the  scores  so  that  the  maximum  of  the  absolute  value  of  the  sitm  of  the 
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scores  is  1.  Hence  it  was  desirable  to  have  the  maximum  of  the  sum  of  scores  to  be  1  and 
for  symmetry  reasons  to  also  have  the  minimum  of  the  sum  of  scores  to  be  —1.  The  two 
criteria  were  satisfied  by  setting  the  scores  of  predictor  5  to  the  values  presented. 

From  the  examples  in  chapter  3  three  parametric  curves  seem  to  be  appropriate;  the 
linear  curve,  the  exponential  curve,  and  the  logistic  curve.  The  curves  were  defined  on  the 
range  of  the  scores,  i.e.  [—1, 1]  ,  and  were  standardized  by  scaling  and  shifting  them  to  have 
the  value  of  0  at  —1  and  1  at  1.  The  three  curves  used  to  generate  the  data  were 

1)  linear  f{x)  = 

2)  exponential  f{z)  = 

3)  logistic  f{z)  = 

With  the  data  generated,  normal  errors  were  then  added.  To  test  the  procedure  in  the 
best  circumstances  a  trial  with  no  noise  was  run.  In  addition  noise  with  twelve  different 
values  of  standard  deviations  were  also  considered; 


xH- 1 


p.  — 


„T.5ir 


1  +  e 


7.5  jf 


(4,1.2) 


(T  =  .025,  .05,  .075,  .10,  .125,  .15,  .175,  .20,  .25,  .30,  .40,  .50  (4.1.3) 

Several  small  values  of  a  were  included  to  observe  the  influence  of  increasing  the  amount  of 
noise  in  the  data.  The  large  values  were  included  to  observe  the  breakdown  of  the  model. 
From  the  scaling  of  the  curves  used  to  generate  the  data  a  standard  deviation  of  ^  =  .125, 
for  example  ,  corresponds  to  a  noise  to  signal  ratio  of  12.5  %. 

After  generating  the  data  eleven  different  proportions  of  data  were  removed  from  the 
table  to  observe  the  behavior  of  the  algorithm  in  cases  of  missing  data.  The  proportions 
chosen  are  listed  below: 
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proportion 

number 

number 

removed 

removed 

left 

0 

0 

288 

5 

14 

274 

10 

29 

259 

15 

43 

245 

20 

58 

230 

25 

72 

216 

30 

86 

202 

40 

115 

173 

50 

144 

144 

60 

173 

115 

70 

202 

86 

Two  methods  were  chose  to  eliminate  the  data  from  the  table. 

1)  random  deletions  —  the  data  to  be  discarded  as  missing  is  chosen  uniformly  from  the 

table.  For  example,  if  29  observations  are  to  be  deleted,  then  with  equal  probability 
any  of  the  (  )  sets  of  29  numbers  are  chosen. 

2)  select  deletions  —  this  is  a  meager  attempt  to  model  some  kind  of  dependence  be¬ 
tween  the  observations  which  are  missing  and  their  position  in  the  table.  The  cells 
corresponding  to  the  second  category  of  the  first  predictor  or  the  second  category  of 
the  third  predictor  had  a  greater  chance  of  being  eliminated.  More  precisely,  let  Cj{z) 
be  the  category  of  the  j**  predictor  corresponding  the  the  cell  entry  ar.  Then  if  a 
total  of  X  observations  were  to  be  removed  from  the  table,  z/9  would  be  removed 
from  {x|(7o(x)  =  Gz{x)  =  2},  2x/9  would  be  removed  from  {x\Gz{x)  ^  Go{x)  =  2}, 
x/9  would  be  deleted  from  {xlC7o(x)  ^  C'sCx)  =  2},  and  the  remaining  x/3  from 
{x\G,i(x)  #  2,Cz(x)  ^  2}. 

Thus,  to  summarize,  a  contingency  table  with  5  predictors  Xi, . . . , X5  was  considered. 


The  response  could  be  expressed  as 
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Yiu„.,is  “  /( ^y(v))  ^  (4.1.5) 

/=i 

where  the  scores  are  given  in  (4.1.1)  and  the  function  is  one  of  the  functions  in  (4.1.2).  The 
noise,  e,  was  normal  with  zero  mean  and  standard  deviation  one  of  the  values  in  (4.1.3). 
One  of  the  several  different  proportions  listed  in  (4.1.4)  were  removed  by  either  random 
deletions  or  selected  deletions. 

For  each  combination  100  trials  were  run.  In  each  case  the  ciurve  generated  by  the  pro¬ 
cedure  was  evaluated  at  intervals  of  one  tenth  from  —1  to  1  to  obtain  some  measure  of  how 
close  the  curve  found  approximated  the  true  curve.  A  problem  arises  when  the  maximum 
of  the  sum  of  true  scores  corresponding  to  the  the  observations  in  the  deleted  table  was  less 
than  one.  If  the  procedure  found  the  true  curve,  it  would  actually  determine  only  a  portion 
of  the  curve  interior  to  the  range  [—1,1].  Since  the  procedure  used  the  scale  convention 
that  the  maximum  sum  was  one,  the  portion  would  be  magnified  by  some  constant  and 
would  have  the  affect  of  altering  the  apparent  fit  between  the  true  and  estimated  curve. 
To  lessen  this,  the  procedure  kept  track  of  the  “correct”  scaling  of  the  scores,  extending 
by  linear  approximation  those  values  which  were  outside  the  range  of  the  approximating 
curve.  Although  this  practice  is  not  optimal,  the  two  scales  differed  only  rarely  in  tables 
which  had  many  missing  observations  and  the  differences  were  usually  quite  small. 


4.2.  Results. 

For  each  of  the  combinations  listed  above  two  meastires  of  goodness  of  fit  are  calculated. 
The  first  measure  is  the  standard  deviation  of  the  fit  to  the  data.  This  is  the  square  root  of 
the  average  square  deviations  of  the  data  to  the  value  estimated  by  the  P.ACE  algorithm. 
The  second  is  the  standard  deviation  of  the  fit  to  the  model.  This  is  the  the  square  root  of 
the  average  of  the  squared  deviations  of  the  true  underlying  curve  to  the  curve  estimated 
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by  the  P.ACE  algorithm,  evaluated  at  the  points  {—1.0, —0.9, ...,0.9, 1.0}.  The  results 
can  be  found  in  table  la  through  table  VIb  at  the  end  of  this  chapter. 

The  first  observation  concerns  the  deviation  of  the  fit  to  the  data.  In  all  cases  consid¬ 
ered  the  error  is  relatively  constant  across  the  differing  amounts  of  missing  data  and  the 
type  of  deletion  used.  For  the  exponential  and  linear  models,  except  for  very  low  levels  of 
noise  to  signal  ratios  the  deviation  of  fit  to  the  data  is  smaller  than  the  standard  devia¬ 
tion  of  the  noise  added  to  the  data.  This  suggests  that  the  procedure  is  over-fitting  the 
data,  although  the  difference  between  the  two  numbers  is  quite  small  in  most  cases  and  is 
unlikely  to  be  very  serious.  The  fit  for  the  logistic  model  is  somewhat  worse,  only  a  few 
cases  having  a  smaller  standard  deviation  to  fit  than  the  standard  deviation  of  the  noise 
added.  This  is  most  likely  due  to  the  fact  that  unlike  the  exponential  and  linear  curves,  the 
logistic  curve  is  neither  convex  nor  concave.  The  procedure  appears  to  perform  quite  well 
in  general  when  trying  to  fit  straight  lines  and  only  slightly  less  well  in  fitting  functions 
which  deviate  slightly  from  linear.  The  convexity  of  the  exponential  functions  considered 
is  relatively  small,  undoubtably  accoxmting  for  some  of  the  success  of  the  procedure  in  that 
case.  The  logistic  function,  however,  is  highly  non-linear  and  the  procedure  has  difficulty 
in  finding  the  exact  curve. 

The  second  observation  concerns  the  standard  deviation  of  the  fit  to  the  model.  As 
would  be  expected,  the  fit  degenerates  as  either  the  number  of  observations  missing  or  the 
noise  to  signal  ratio  increases.  What  is  important  to  note  is  that  the  deterioration  of  the 
fit  is  not  very  sensitive  to  the  number  of  missing  values.  This  indicates  that  the  procedure 
may  be  of  use  when  large  amount  of  data  is  missing. 

The  ability  of  the  procedure  to  locate  the  correct  scores  appeared  to  be  very  insensitive 
to  either  the  type  of  deletions,  the  imderlying  model,  the  amount  of  noise  or  the  amount 
of  data,  provided  that  noise  to  signal  ratio  was  30%  or  less.  For  this  reason  only  one  set  of 
scores  have  been  included.  Page  63  contains  the  average  scores  obtained  from  the  P.ACE 
procedure  for  the  logistic  model  with  30%  of  the  observations  selectedly  deleted  and  with  a 
noise  to  signal  ration  of  25%.  Included  also  are  the  true  scores  and  the  standard  deviation 
of  each  score  from  its  mean.  As  can  be  seen,  there  is  close  agreement  between  the  estimated 
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scores  and  the  true  values,  and  the  estimated  scores  tend  to  be  rather  stable.  There  seems  to 
be  a  tendency  to  find  scores  having  greater  uniformity  in  between  score  spacings  than  exists 
in  the  original  scores.  The  two  inner  scores  of  predictor  1  have  been  separated  somewhat 
and  the  procedure  has  brought  the  scores  of  predictor  2  closer  to  a  set  of  uniformly  spaced 
scores.  Nevertheless,  these  affects  are  rather  slight. 

To  obtain  some  idea  of  how  well  the  model  estimated  the  the  underlying  curve  two 
graphs  were  drawn  for  each  of  the  underlying  curves.  To  illustrate  the  ability  of  the  proce¬ 
dure  to  find  structure  in  sparse  tables  all  the  examples  were  chosen  from  tables  missing  50 
%  of  the  observations.  There  seemed  to  be  little  dependence  of  the  curves  on  the  type  of 
deletion  present  in  the  table.  The  tables  with  random  deletions  and  30  %  noise  to  signal  ra¬ 
tio  and  also  the  tables  with  select  deletions  with  15  %  noise  to  signal  ratio  were  chosen  from 
each  of  the  three  different  curves.  The  true  value  of  the  curve  was  drawn  in  circles  at  each 
of  the  points  {“1.0,  —0.9, , . .  ,0.9, 1.0}.  Also  at  these  points  the  median,  the  quartiles,  and 
the  5  and  95  percentiles  were  calculated  from  the  100  trials  of  the  simulation  and  plotted. 
The  medians  were  connected  with  a  straight  line,  the  quartiles  with  dotted  lines  and  the  5 
and  95  percentiles  with  dotted  lines,  giving  some  idea  of  how  well  the  curves  were  approxi¬ 
mated.  The  curves  are  rather  self-explanatory.  Just  a  few  points  should  be  mentioned.  The 
flaring  out  at  the  ends  is  common  in  smoothers  and  is  caused  by  the  asy metric  position  of 
the  data  in  the  window  of  the  smoother.  It  should  be  noticed  that  interior  to  the  interval 
[— 1, 1]  the  fit  is  rather  good  and  the  5  and  95  percentiles  are  close  to  the  true  values.  Prom 
the  graphs  it  appears  as  if  the  procedure  is  somewhat  median  biased,  although  the  curves 
corresponding  to  15  %  noise  to  signal  ratio  have  very  little  bias  in  the  interior  of  the  range. 


4.3.  Conclusions. 


The  P.ACE  algorithm  presented  offeres  a  method  of  fitting  the  model 
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(4.3.1) 


to  contingency  tables.  The  model  appears  to  be  a  reasonable  model,  corresponding  to 
certain  established  parametric  models  such  as  the  independent  model  or  the  logistic  model. 
As  shown  in  chapter  3,  it  also  yields  similar  results  to  these  classical  models  in  those 
instances  that  the  models  describe  the  data  well. 


There  appears  to  be  several  advantages  of  using  the  P.ACE  algorithm  and  the  model 
(4.3.1)  .  The  first  advantage  is  in  the  generality  of  the  model  and  the  few  assumptions  made 
about  the  data.  Because  the  model  mimics  the  classical  models  well  it  can  also  be  used  as 
a  means  of  model  selection,  and  prove  useful  in  exploratory  analysis  of  contingency  tables. 
In  many  cases  the  scores  obtained  by  the  procedure  can  help  give  some  indication  of  the 
relative  importance  of  each  predictors,  and  can  give  some  means  of  interpreting  the  effects 
certain  predictors  on  the  response. 

The  second  advantage  is  the  ability  to  perform  well  under  conditions  in  which  much 
of  the  data  is  missing.  The  method  can  provide  estimates  for  the  zero  cells  of  a  table  while 
making  a  minimal  set  of  assumptions  and  can  help  detect  patterns  in  tables  which  may  be 
difficult  to  observe  otherwise. 


There  are  also  several  disadvantages  of  the  procedure.  The  procedure  does  not  perform 
well  when  the  contingency  table  is  small  or  when  the  range  of  the  response  is  large  compared 
the  the  number  of  observations.  The  first  problem  is  due  to  the  fact  that  the  procedure 
is  estimating  a  large  number  of  parameters  and  small  tables  do  not  provide  enough  infor¬ 
mation.  The  second  problem  is  related  to  the  smoothing  algorithm  incorporated  into  the 
procedure.  The  amount  of  possible  curvature  in  the  output  of  the  smoother  is  a  function 
of  the  bandwidth,  which  is  bounded  below  by  the  number  of  data  points.  Thus  small  data 
sets  can  not  produce  a  smooth  curve  with  a  large  curvature.  The  yam  data  of  chapter  3 
illustrates  this  point. 

There  is  a  definite  lack  of  theory.  It  is  straightforward  to  show  that  under  general 
conditions  an  optimal  set  of  scores  and  functions  exist  which  minimize  the  mean  square  error 
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of  the  model  (4.3.1).  It  is  equally  straight  forward  to  construct  examples  of  non-uniqueness 
of  the  solutions.  There  is  also  no  proof  of  convergence  of  the  algorithm,  although  in  practice 
this  had  never  presented  any  problem  with  the  data  used. 

In  spite  of  the  shortcomings,  however,  the  method  presented  seems  to  be  a  useful 
method  in  the  exploratory  analysis  of  contingency  tables,  and  one  worth  pursuing. 


4.4.  Graphs  and  Charts. 

This  section  contains  the  charts  and  graphs  discussed  in  the  esurlier  sections  of  this 


chapter. 
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Table  la 

standard  deviation  of  fit  to  data 
linear  model  with  random  deletions 


proportion  of  data  missing 
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.15 
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.000 

.00 

.01 

.01 

.01 

.02 

.02 

.02 

.03 

.03 

.04 

.04 

.025 

.02 

.03 

.03 

.03 

.03 

.03 

.03 

.04 

.04 

.04 

.05 
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Table  Ib 

standard  deviation  of  fit  to  model 
linear  model  with  random  deletions 


proportion  of  data  missing 
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Table  Ila 

standard  deviation  of  fit  to  data 
linear  model  with  select  deletions 


proportion  of  data  missing 
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Table  Ilb 

standard  deviation  of  fit  to  model 
linear  model  with  select  deletions 


proportion  of  data  missing 
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Table  Ilia 

standard  deviation  of  fit  to  data 
exponential  model  with  random  deletions 


proportion  of  data  missing 


.00 

.05 

.10 

.15 

.20 

.25 

.30 

.40 

.50 

.60 

.70 

.000 

.01 

.02 

.02 

.02 

.02 

.02 

.02 

.03 

.03 

.04 

.04 

.025 

.03 

.03 

.03 

.03 

.03 

.03 

.03 

.04 

.04 

.04 

.05 

.050 

.05 

.05 

.05 

.05 

.05 

.05 

.05 

.05 

.06 

.06 

.06 

.075 

.07 

.07 

.07 

.07 

.07 

.08 

.08 

.08 

.08 

.08 

.08 

.100 

.10 

.10 

.10 

.10 

.10 

.10 

.10 

.10 

.10 

.10 

.10 

.125 

.12 

.12 

.12 

.12 

.12 

.12 

.12 

.12 

.12 

.12 

.12 

.150 

.15 

.15 

.15 

.15 

.15 

.15 

.15 

.15 

.15 

.14 

.15 

.175 

.17 

.17 

.17 

.17 

.17 

.17 

.17 

.17 

.17 

.17 

.16 

.200 

.20 

.20 

.19 

.19 

.19 

.19 

.19 

.19 

.19 

.19 

.19 

.250 

.24 

.25 

.24 

.24 

.24 

.24 

.24 

.24 

.24 

.23 

.23 

.300 

.29 

.29 

.29 

.29 

.29 

.29 

.29 

.29 

00 

.28 

.27 

.400 

.39 

.39 

.39 

.38 

.39 

.38 

.38 

.38 

.37 

.37 

.36 

.500 

.48 

.48 

.48 

.48 

.48 

.48 

.48 

.47 

.47 

.47 

.44 

>1  p  a.  0 


Section  4-4-  Graphs  and  Charts  56 


Table  lUb 

standard  deviation  of  fit  to  model 
exponential  model  with  random  deletions 


proportion  of  data  missing 
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Table  IVa 

standard  deviation  of  fit  to  data 
exponential  model  with  select  deletions 


proportion  of  data  missing 
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Table  IVb 

standard  deviation  of  fit  to  model 
exponential  model  with  select  deletions 


proportion  of  data  missing 
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Table  Va 

standard  deviation  of  fit  to  data 
logistic  model  with  random  deletions 
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l^ble  Vb 

standard  deviation  of  fit  to  model 
logistic  model  with  random  deletions 


proportion  of  data  missing 
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Table  Via 

standard  deviation  of  fit  to  data 
logistic  model  with  select  deletions 


proportion  of  data  missing 
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Table  VIb 

standard  deviation  of  fit  to  model 
logistic  model  with  select  deletions 


proportion  of  data  missing 
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.250 

.11 

.12 

.12 

.13 

.14 

.15 

.14 

.15 

.19 

.19 

.28 

.300 

.13 

.13 

.13 

.16 

.17 

.18 

.18 

.19 

.21 

.28 

.33 

.400 

.17 

.20 

.21 

.20 

.23 

.24 

.26 

.26 

.35 

.37 

.46 

.500 

.28 

.30 

.29 

.30 

.35 

.37 

.38 

.41 

.41 

.45 

.55 
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Predictor 

1) 

2) 

3) 

4) 

5) 


Scores  found  for  Logistic  Model 

select  deletions 
30%  observations  missing 
Standard  Deviation  =  0.25 


True 

Estimated 

standard 

Scores* ** 

Scores 

deviation’ 

0.100 

0.104 

3.71 

-0.033 

-0.048 

3.94 

-0.100 

-0.100 

3.52 

0.033 

0.044 

4.05 

-0.220 

-0.206 

3.55 

-0.140 

-0.141 

3.35 

0.020 

0.014 

4.32 

0.340 

0.332 

4.32 

0.200 

0.188 

3.00 

0.200 

0.186 

3.49 

-0.400 

-0.373 

4.56 

-0.160 

-0.154 

2.51 

0.160 

0.154 

2.51 

-0.120 

-0.113 

3.05 

-0.080 

-  0.077 

3.38 

0.200 

0.191 

3.01 

*  Three  significant  digits  given  only 

**  Standard  deviations  expressed  in  terms  of  10"^. 
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Appendix 


Smoothing  Algorithms 


There  is  much  literature  on  smoothing  and  the  reader  unfamiliar  with  these  techniques 
is  referred  to  [11],  [18],  [20],  What  follows  is  an  abbreviated  discussion,  concentrating 
primarily  on  the  method  employed  in  the  P.ACE  procedure  demonstrated  in  this  work. 

Given  a  set  of  observations  ^  possible  summary  of  the  data  is 


W  =  f{xi)  +  r,-  (A.l) 

where  /(•),  called  the  smooth,  satisfies  some  smoothness  constraint  and  is  chosen  to  mini¬ 
mize 


*=1  i=l 

The  model  (A.l)  is  appropriate,  in  particular,  if  it  is  assumed  that  the  underlying  random 
variables  which  generated  the  observations  satisfy 

E{Y  \X^z)^g{x)  (A.3) 

in  which  case  the  smooth  /(•)  of  (A.1)  can  be  viewed  as  an  estimate  of  the  conditional 
expectation  in  (A.3).  A  smoother  is  a  procedure  which  has  as  input  a  set  of  bivariate 
observations  and  returns  the  smooth  fimction  satisfying  the  decomposition  (A.l). 

One  method  of  estimating  the  smooth  is  by  use  of  local  linear  least  squares  fits.  Let 
{(X,*,  Yi)}"-!  be  a  fixed  set  of  n  bivariate  observations,  assuming  -Yy  <  Xm  if  j  <  At 
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each  observation  point  Xy  let  Jife(Xy)  be  the  set  of  observations 

Fix  k.  The  local  linear  smoothing  procedure  defines  the  smooth  at  Xy  to  be  the  value  of 
the  least  squares  straight  line  of  the  points  in  Jk{Xj)  evaluated  at  the  point  Xy.  Linear 
interpolation  is  used  to  extend  the  definition  of  the  smooth  to  abscissa  values  other  than 
those  included  in  the  observations. 

The  number  2  Jb  +  1  is  known  as  the  span  and  controls  the  variability  of  the  output  of 
the  smoother.  As  the  span  increases  the  curvature  as  well  as  the  variance  of  the  estimated 
curve  decreases  while  the  bias  increases.  Thus  when  smoothing  a  scatterplot  it  is  desirable 
to  decrease  the  span  in  those  regions  in  which  the  underlying  curve  has  high  curvature  and 
the  observations  have  small  variance.  Likewise,  in  those  regions  in  which  the  imderlying 
curve  has  small  curvature  and  the  observations  have  high  variance  a  large  span  is  desired. 

One  possible  approach  to  a  such  a  variable  span  smoother  is  to  use  a  point-wise  convex 
combination  of  different  fixed  span  smoothers  with  weights  depending  on  some  goodness- 
of-fit  measure.  More  precisely,  let  C7i,...,C7p  be  the  smooth  curves  obtained  by  using  p 
different  spans  of  a  local  linear  smoother.  The  smooth  curve  Cf  generated  by  such  a 
variable  span  smoothing  procedure  can  be  written  as 

=  'E^kiXj)  CtiXj)  (A.4) 

k=l 

where  the  weights  {tt?fc(Xy)}ife  are,  for  each  Xy,  a  partition  of  unity. 

One  simple  method  of  defining  these  weights  would  be  to  let  Wk{Xj)  be  proportional 
to  (ly  “  ^^(Xy))"^.  The  difficulty  with  this  approach  is  that  wjfe(Xy)  need  not  be  close 
to  tafc(Xy4.i).  This  results  in  a  curve  C7/(*)  which  may  have  high  frequency  components. 
Consequently,  some  alternate  definition  is  necessary  which  will  insure  that  the  weights  vary 
^smoothly”.  One  approach  is  to  consider  the  cross-validated  squared  errors  difference  in 
some  neighborhood  of  the  point  Xy.  Fixing  a  number  /,  using  this  approach  the  weights 
can  be  expressed  as 
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-2 


a 


X;  (Y^-CtiXn,)) 

m€/*(Xm) 


(A.5) 


where  CJ(Xm)  is  the  value  at  the  point  Xm  of  the  least  squares  line  fit  to  the  points 
J^(Xfn)  —  {-X",!*}.  The  smallest  span  has  on  average  the  smallest  difference  between  the 
corresponding  curve  and  the  observations.  This  is  due  to  the  fact  that  the  observation  is 
used  to  form  its  own  prediction,  and  its  relative  influence  increases  as  the  span  decreases. 
Thus  there  is  a  bias  towards  smaller  spans  due  to  overfitting.  Since  the  observations  contain 
noise,  a  small  difference  between  the  smooth  and  the  observations  does  not  imply  a  small 
difference  between  the  smooth  and  the  underlying  curve  /(•)  in  (A.l).  Consequently  a  bias 
towards  small  spans  does  not  guarantee  a  good  overall  fit.  Cross-validation  is  a  method  of 
getting  rid  of  the  bias. 


Such  a  smoothing  technique  was  employed  in  the  P«ACE  procedure  demonstrated  in 
this  work.  A  set  of  three  fixed  span  local  linear  smoothers  were  implemented  with  spans 
10%  ,  20%  and  50%  of  the  number  of  observations  and  the  number  I  used  to  determine  the 
neighborhood  size  of  (A.5)  was  set  to  10%  of  the  number  of  observations. 

There  exist  simple  updating  formulas  for  the  local  linear  smoothers  which  allow  an 
order  n  algorithm  for  the  computation  of  the  fixed  span  smoother.  The  calculations  of  the 
weights  of  (A.5)  can  be  incorporated  as  part  of  the  local  linear  smoother. 

The  P.ACE  algorithm  as  implemented  requires  in  addition  to  the  smooth  function 
some  estimate  of  its  derivative.  The  local  linear  smoother  uses  least  square  lines  to  define 
the  smooth  at  a  point  Xy.  Associated  with  the  line  is  a  slope  my  which  can  be  used  as  an 
estimate  of  the  slope  of  the  curve  at  ATy.  For  each  span  at  each  observation  Xj  the  slope 
of  the  least  squares  line  is  stored.  A  convex  combination  of  the  slopes  at  each  observation 
point  using  the  weights  defined  by  (A.5)  is  then  formed.  Although  this  would  seem  to  yield 
a  reasonable  estimate  of  the  derivative  practice  has  shown  that  the  resulting  curve  tends 
to  contain  high  frequency  components.  Consequently,  to  dampen  these  components  a  local 
linear  smoother  with  span  40%  of  the  observations  is  applied  to  the  convex  combination 
and  the  resulting  curve  is  taken  as  the  estimate  of  the  derivative. 


Bibliography 


[  1]  Aickin,  M.  (1983)  Linear  Statistical  Analysis  of  Discrete  Data,  New  York: 

Wiley  k  Sons 

[2]  Barlow,  Bartholemew,  Bresmer  and  Bronk  (1972)  Statistical  Inference  Under 

Order  Restrictions,  New  York:  Wiley  Sc  Sons 

[3]  Bartlett  M.  S.  (1935)  “Contingency  Table  Interactions” ,  Journal  of  the  Royal 

Statistical  Society  Supplement,  2,  248  —  252 

[4]  Berkaon,  J.  (1955)  “Maximum  likelihood  and  minimiun  chi-square  estimates 

of  the  logistic  function”,  Journal  of  the  Applied  Statistical  Association,  50, 
130  —  162 

[  5]  Bhapker.  V.  P.  and  Koch,  6.  6.  (1968)  “Hypothesis  of  ‘No  Interaction’  in 
Multidimensional  Contingency  Tables” ,  Tecbnometrics,  10,  107  —  123 

[  6]  Birch,  Y.  M.  M.  (1963)  “Maximum  Likelihood  in  Three-Way  Contingency 

Tables”,  Journal  of  the  Royal  Statistical  Society,  series  B,  25,  220  —  233 

[7]  Bishop,  Y.  M.  M.  (1969)  “Full  contingency  tables,  logit,  and  split  contingency 

tables.”.  Biometrics,  25,  383  —  400 

[  8]  Bishop,  Y.  M.  M.  (1971)  “Effects  of  Collapsing  Multidimensional  Contingency 

Tables”,  Biometrics,  27,  545  —  562 

[9]  Box,  G.E.P.,  and  Cox,D.R.  (1964)  “An  analysis  of  transformations” ,  Journal 

of  the  Royal  Statistical  Society,  series  B,  26,  211  —  252 

[10]  Breiman,  L.  and  Friedman,  J.  (1982)  “Estimating  optimal  Transformation 

for  multiple  regression  and  correlation”.  Department  of  Statistics,  Stanford 
University, Tech.  Report  ORION  06 

[11]  Cleveland,  W.  S.  (1979)  “Robust  locally  weighted  regression  and  smoothing 

scatterplots”.  Journal  of  the  American  Statistical  Association,  74,  828  —  836 


Bibliography :  74 


Craig.  C.  C.  (1953)  “Combination  of  Neighboring  Cells  in  Contingency  Ta¬ 
bles”,  Jouma/ of  the  American  Statisticai  Association,  48,  104  —  112 

Deming.  W.  E.  and  Stephan.  F.  F.  (1940)  “On  a  Least  Squares  Adjustment 
of  a  Sampled  JVequency  When  the  Expected  Marginal  Totals  are  Known”,  An¬ 
na/s  of  Mathematical  Statistics,  11,  427  —  444 

Fienberg.  S.  E.  (1970)  “Quasi-independence  and  maximum  likelihood  esti¬ 
mation  in  incomplete  contingency  tables”.  Journal  of  the  American  Statistical 
Association,  66,  1610  —  1616 

Fienberg.  S.  E.  (1980)  The  Analysis  of  Gross-GIassiSed  Categorical  Data 
(second  addition^,  Cambridge  Massachusettes:  The  MIT  Press 

Fienberg.  S.  E.  and  Holland.  P.  W.  (1973)  “Simultaneous  estimation  of 
multinomial  cell  probabilities”,  Journal  of  the  American  Statistical  Association, 
68,  683  —  691 

Fisher,  R.  (1938)  Statistical  methods  for  Research  Workers  (tenth  addition), 
Edinburgh:  Oliver  and  Boyd 

Friedman,  J.  (1984)  “A  Variable  Span  Smoother” ,  Department  of  Statistics, 
Stanford  University, Tech.  Report  5 

Friedman.  J.  and  Oven,  A.  “in  progress” 

Friedman.  J.  and  Tibshirani,  R.  (1983)  “The  monotone  smoothing  of  scat- 
terplots”.  Department  of  Statistics,  Stanford  University, Tech.  Report  ORION 
23 

Goodman.  L.  A.  (1968)  “The  analysis  of  cross-classified  data:  Independence, 
quasi-independence,  and  interactions  in  contingency  tables  with  or  without  miss¬ 
ing  entries”,  Journal  of  the  American  Statistical  Association,  68,  1091  — 
1131 

Goodman.  Leo  A.  (1971)  “The  analysis  of  Multidimensional  Contingency 
Tables:  Stepwise  Procedures  and  Direct  ESstimation  Methods  for  Building 
Models  for  Multiple  Classification”,  Technometrics,  IS,  33  —  61 


Bibliography :  75 


Kendall,  H.  6.  and  Stuut,  A  (1977)  The  Advanced  Theory  of  Statistics 
(fourth  edition),  London:  C.  Griffin 

Lancaster  H.  D.  (1951)  “Complex  contingency  tables  treated  by  the  partition 
of  chi-square”,  Journal  of  the  Royal  Statistical  Society,  series  B,  18,  242  — 
249 

Pearson,  K  (1900)  “On  a  criterion  that  a  given  system  of  deviations  from 
the  probable  in  the  case  of  a  correlated  system  of  variables  is  such  that  it  can 
be  reasonably  supposed  to  have  arisen  from  random  sampling”.  Philosophical 
Magazine,  50,  157  —  175 

Quetelet,  A.  (1827)  “Recherches  sur  la  Population,  les  Naissances,  etc.,  dans  le 
Royaume  des  Pays-Bas” ,  Nouveaux  Mimoires  de  VAcadSmie  Royale  des  Sciences 
et  Belles-Lettres  de  Bruxelle,  4,  117  —  174 

R4nyi,  A.  (1959)  “On  measiires  of  dependence”,  Acta  Math.  Acad.  Sci. 
Hung.,  10,  441  —  451 

Roy,  S.  N.  and  Rastenbaum,  M.  A.  (1956)  “On  the  hypothesis  of  no ‘inter¬ 
action’  in  a  multiway  contingency  table”.  Annals  of  Mathematical  Statistics, 
27,  749  —  757 

Simonoll,  J.  S.  (1983)  “A  penalty  function  approach  to  smoothing  large 
sparse  contingency  tables”.  Annals  of  Statistics,  11,  208  —  218 

United  Nations  (1982)  Demographic  Yearbook  -  1980,  New  York:  United 
Nations  Publication 

Young,  F.  W.  (1981)  “Quantitative  analysis  of  qualitative  data” ,  Psychome- 
trika,  46,  357  —  388 

Yule,  G.  U.  (1900)  “On  the  association  of  attributes  in  statistics:  with  illus¬ 
trations  from  the  material  of  childhood  society,  &c.” ,  Philosophical  Jkansaction 
of  the  Royal  Society,  Series  A,  194,  257  —  311 


